Existing web agent evaluation suffers from three key limitations: (1) insufficient benchmark coverage β offline benchmarks lack real-world fidelity, while online benchmarks remain limited in website scale, domain diversity, and intent variety, leading to biased and overly optimistic assessments; (2) unscalable evaluation methods β human annotation is costly and impractical at scale, and current automated approaches lack sufficient accuracy on complex queries; (3) narrow evaluation dimensions β existing protocols fail to systematically assess agents' ability to leverage external knowledge and execute end-to-end workflows required for real-world deployment. WebRetriever addresses all three gaps through large-scale diverse benchmarking, a reliable automated evaluation method, and deployment-oriented evaluation protocols.
(1) A large-scale, comprehensive benchmark for realistic web agent evaluation. We curate 1,550 tasks across 800 real websites spanning diverse domains and user intents. Compared with prior benchmarks, WebRetriever provides unprecedented scale, diversity, and coverage, enabling more comprehensive and representative evaluation of web agents in realistic online environments.
(2) A general and high-precision automated evaluation method. We propose NavEval, an automated evaluation method that attains approximately 90% human-level agreement in large-scale experiments, thereby enabling practical and reliable assessment of web agent performance at scale and in real-time.
(3) Comprehensive evaluation framework. We propose three complementary evaluation protocols to systematically assess web agents, explicitly disentangling navigation success from answer correctness and characterizing behavioral reliability under injected operational knowledge, thereby providing diagnostic signals missing from prior benchmarks.
(a) Website Domain Distribution
(b) Website Type Distribution
(c) General Intent Distribution
(d) Professional Intent Distribution
Current LLM-based evaluation pipelines exhibit limitations in organizing and utilizing evaluation signals. Existing methods employ three screenshot strategies: (1) full-trajectory screenshots, which provide comprehensive coverage but suffer from redundancy and high computational costs; (2) final screenshot only, which captures task completion status but misses critical intermediate process information; and (3) LLM-selected key screenshots, which reduce redundancy but lack sufficient accuracy in key step selection and matching. These limitations reveal the core challenge of automated evaluation: how to extract critical information to enhance LLM judgment accuracy. To address these limitations, we propose NavEval, which integrates more diverse information from agent-browser interactions. Beyond actions and final screenshots, NavEval leverages all navigation URLs and network requests generated during interactions. Through rule-based filtering to eliminate substantial noise, these signals are restructured into organized information that enhances LLM judgment accuracy.
To comprehensively evaluate web agent capabilities, we design three protocols reflecting realistic deployment scenarios. Protocol I
assesses fundamental navigation ability by measuring whether agents can reach target pages given user tasks. Protocol II evaluates navigation
performance when agents are provided with operational documentation. For Protocol II, operational knowledge is consolidated
into structured documentation through a closed-loop framework. Web agents first generate action trajectories for target tasks, which NavEval
evaluates automatically. Correct trajectories proceed to annotation, while incorrect ones are regenerated until reaching retry limits, after
which human annotators refine them. The annotation platform further verifies accepted trajectories and streamlines redundant operations.
Finally, LLMs generate operation manuals from these refined trajectories, completing the documentation pipeline. While Protocols I and II
focus on navigation, realistic web environments require capabilities beyond page navigation. Agents must demonstrate deep page understanding,
multi-source information integration, and precise extraction capabilities critical for holistic evaluation. Protocol III mirrors realistic
deployment by assessing whether agents can accurately retrieve target information across multiple modalities (text, documents, charts).
Based on these designs, we collect 1,000, 1,000, and 100 tasks for Protocols I, II, and III, respectively. Notably, Protocols I and II share 550 overlapping
tasks, with the sole difference being the availability of operational documentation.
Important Notice:
Results not listed on this leaderboard have not been verified by WebRetriever Team, and their accuracy is not guaranteed.
(1) Protocol I — Evaluates basic navigation ability to reach target pages.
(2) Protocol II — Assesses navigation performance when provided with operational knowledge.
(3) Protocol III — Measures end-to-end task completion by jointly evaluating navigation and information extraction, avoiding the limitation of equating page arrival with task success.
For Protocols I and II, we maintain both human-verified success rates and NavEval automated success rates. For Protocol III, since it evaluates end-to-end task completion (including information extraction), only human-verified success rates are reported.
π Note: The Protocol III evaluation set has been expanded from the original 50 tasks (as reported in the paper) to 100 tasks for more comprehensive assessment. As a result, Protocol III scores on this leaderboard may differ from those reported in the paper.
In our implementation, NavEvalβs judgment module adopts Claude-4.5-Sonnet to ensure stable semantic understanding and reasoning.
Loading verified benchmark data...
To facilitate transparent and reproducible evaluation, we will publicly release the complete evaluation framework, including implementations of all three evaluation protocols, the NavEval automated evaluation pipeline, and human evaluation guidelines.
@article{webretriever-2026,
title={WebRetriever: A Large-Scale Comprehensive Benchmark for Efficient Web Agent Evaluation},
author={Wei Dong and Tianyu Fu and Zhe Yu and Hanning Wang and Anyang Su and Zhizhou Fang and Yuyang Chen and Shuo Wang and Minghui Wu and Ping Jiang and Zhen Lei and Chenxu Zhao},
year={2026},
note={To be published},
url={https://github.com/hhhhhhalf/WebRetriever}
}