Next-Gen CAPTCHAs: GUI-Agent Defense

Introduce Next-Gen CAPTCHAs

The rapid evolution of GUI-enabled agents has rendered traditional CAPTCHAs obsolete. While previous benchmarks established a baseline for evaluating multimodal agents, recent advancements in reasoning-heavy models have effectively collapsed this security barrier, achieving pass rates as high as 90% on complex logic puzzles.

We introduce Next-Gen CAPTCHAs, a scalable defense framework with 27 newly-designed GUI-Agent Era's CAPTCHA families designed to secure the next-generation web against advanced agents. We exploit the persistent human-agent "Cognitive Gap" in interactive perception, memory, decision-making, and action. By engineering dynamic tasks that require adaptive intuition rather than granular planning, we re-establish a robust distinction between biological users and artificial agents.

Key Results

Performance comparison across frontier AI models on our benchmark^†

98.8% Human

5.9% GPT-5.2-xHigh

3.2% Gemini-3-Flash

3.0% Claude-Opus4.5

1.3% Gemini-3-Pro

1.3% Doubao-Seed-1.8

0.9% Qwen3-VL-Plus

Human

OpenAI (xHigh Thinking Mode, Max level)

Google (High Thinking Mode)

Anthropic (Extended High Thinking Mode)

ByteDance Doubao (High Thinking Mode)

Alibaba (High Thinking Mode)

^† The benchmark contains 519 puzzles in total; due to inference cost constraints, GPT-5.2-xHigh and Claude-Opus-4.5 were evaluated on a 135-puzzle subset.

💰

Economic Asymmetry

The Browse-Use Agent backed by GPT-5.2-xHigh spent $6.02 per puzzle to achieve only 5.9% success rate, while humans solve in 31 seconds per puzzle for free.

⏱

Time Barrier

High-reasoning models require 16-77 minutes per puzzle vs. human's sub-minute performance.

Cognitive Gap Categories

Our CAPTCHAs target 5 fundamental human-agent gaps

Scene-Structure Inference

Observation interpretation and grounding under partial observability

Mirror, Shadow Direction, 3D Viewpoint, Backmost Layer

Temporal Integration

Multi-step evidence accumulation from motion and sequential reveals

Spooky Circle, Structure From Motion, Trajectory Recovery

Numerosity & Invariants

Decision-boundary sensitivity to discrete quantities and counts

Hole Counting, Color Counting, Subway Paths

Latent-State Tracking

Working-memory consistency across interaction steps

Dice Roll Path, Box Folding, Temporal Object Continuity

Perception-to-Action

Robust low-level execution of correct browser interactions

Static Jigsaw, Dynamic Jigsaw, Red Dot

Citation

@misc{liu2026nextgencaptchasleveragingcognitive,
                    title={Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense}, 
                    author={Jiacheng Liu and Yaxin Luo and Jiacheng Cui and Xinyi Shang and Xiaohan Zhao and Zhiqiang Shen},
                    year={2026},
                    eprint={2602.09012},
                    archivePrefix={arXiv},
                    primaryClass={cs.LG},
                    url={https://arxiv.org/abs/2602.09012}, 
              }
}

Next-Gen CAPTCHAs