πŸ’‘ Motivation

Existing web agent evaluation suffers from three key limitations: (1) insufficient benchmark coverage β€” offline benchmarks lack real-world fidelity, while online benchmarks remain limited in website scale, domain diversity, and intent variety, leading to biased and overly optimistic assessments; (2) unscalable evaluation methods β€” human annotation is costly and impractical at scale, and current automated approaches lack sufficient accuracy on complex queries; (3) narrow evaluation dimensions β€” existing protocols fail to systematically assess agents' ability to leverage external knowledge and execute end-to-end workflows required for real-world deployment. WebRetriever addresses all three gaps through large-scale diverse benchmarking, a reliable automated evaluation method, and deployment-oriented evaluation protocols.

webretriever motivation Motivation for the WebRetriever benchmark. WebRetriever addresses key limitations of prior work from three aspects: dataset scale and diversity, automated evaluation reliability, and deployment-oriented evaluation protocols.

πŸ“„ Abstract

As web agents increasingly demonstrate capabilities in automated task execution, the development of robust evaluation frameworks for assessing their navigation and task completion performance has emerged as a critical research priority. However, existing benchmarks exhibit several fundamental limitations. First, they suffer from insufficient scale and limited domain diversity, thereby constraining comprehensive evaluation of cross-domain generalization. Second, prevailing LLM-as-Judge evaluation methodologies inadequately capture fine-grained interaction semantics, particularly regarding precise query formulation and filtering operations. Third, current benchmarks predominantly emphasize navigation success metrics while neglecting critical requirements for real-world deployment scenarios. To address these limitations, we introduce WebRetriever, a large-scale benchmark encompassing 800 websites and 1,550 tasks across diverse domains, including consumer, professional, and enterprise sectors, with comprehensive coverage of user intent patterns. We propose NavEval (Navigation Evaluation), a novel LLM-as-Judge framework that leverages rich interaction context beyond visual screenshots, achieving state-of-the-art alignment with human judgment across multiple evaluation datasets. Furthermore, we establish three complementary evaluation protocols that collectively provide holistic assessment of web agent capabilities: navigation proficiency, knowledge-assisted interaction, and end-to-end task completion with information extraction. Extensive experimental analysis reveals substantial performance disparities across evaluation protocols, demonstrating that navigation success alone serves as an insufficient predictor of real-world application effectiveness. WebRetriever delivers fine-grained diagnostic insights into agent capabilities and establishes a rigorous foundation for advancing web agent research and development.

⭐ Main Contributions

(1) A large-scale, comprehensive benchmark for realistic web agent evaluation. We curate 1,550 tasks across 800 real websites spanning diverse domains and user intents. Compared with prior benchmarks, WebRetriever provides unprecedented scale, diversity, and coverage, enabling more comprehensive and representative evaluation of web agents in realistic online environments.

(2) A general and high-precision automated evaluation method. We propose NavEval, an automated evaluation method that attains approximately 90% human-level agreement in large-scale experiments, thereby enabling practical and reliable assessment of web agent performance at scale and in real-time.

(3) Comprehensive evaluation framework. We propose three complementary evaluation protocols to systematically assess web agents, explicitly disentangling navigation success from answer correctness and characterizing behavioral reliability under injected operational knowledge, thereby providing diagnostic signals missing from prior benchmarks.

πŸ“Š Data Statistics & Comparison

Compared with prior benchmarks, WebRetriever encompasses 1,550 cross-industry tasks and over 800 carefully curated high-quality active websites. To capture the fundamental landscape of mainstream internet behavior, we leverage SimilarWeb traffic data as our baseline, targeting eight core sectors including Technology & Internet, Business & Finance, Education & Research, Entertainment & Travel, Life Services, Healthcare, Industrial Manufacturing, and Public Services, WebRetriever spans eight core industry sectors closely related to everyday life and work. We select the top 30 sites by traffic within each sector, and additionally incorporate specialized vertical sites from authoritative internet research institutions (e.g., 199it) that maintain comprehensive data navigation repositories. This approach particularly strengthens our dataset's evaluation capabilities in business intelligence domains while ensuring broad website type distribution. At the task design level, a diverse team comprising industry experts, mid-level managers, senior data analysts, and university students contributes tasks for websites within their domains of expertise, grounded in authentic user intents. Building upon this foundation, we categorize task intents into two orthogonal dimensions: "general intents" and "professional intents". By combining these intent types with domain-specific "Fact Elements," we generate numerous tasks with distinct scenario characteristics.
webretriever data statistics

(a) Website Domain Distribution

webretriever data statistics

(b) Website Type Distribution

webretriever data statistics

(c) General Intent Distribution

webretriever data statistics

(d) Professional Intent Distribution

Intent-Type Benchmarks Setting Online Interactive Websites Eval-Tasks
General MiniWoB Synthetic ❌ βœ”οΈ 100 100
MiniWoB++ Synthetic ❌ βœ”οΈ 100 100
WebArena Semi-real ❌ βœ”οΈ 4 812
VisualWebArena Semi-real ❌ βœ”οΈ 3 910
WebLINX Real ❌ ❌ 155 1,368
Mind2Web Real ❌ ❌ 137 1,341
Mind2Web-Live Real βœ”οΈ βœ”οΈ 46 104
AssistantBench Real βœ”οΈ βœ”οΈ 258 214
WebVoyager Real βœ”οΈ βœ”οΈ 15 643
Bearcubs Real βœ”οΈ βœ”οΈ 108 111
Online-Mind2Web Real βœ”οΈ βœ”οΈ 136 300
Professional WebShop Semi-real ❌ βœ”οΈ 1 500
ST-WebAgentBench Semi-real ❌ βœ”οΈ 3 222
Wonderbread Semi-real ❌ βœ”οΈ 4 598
WorkArena Real βœ”οΈ βœ”οΈ 5 33
WorkArena++ Real βœ”οΈ βœ”οΈ 5 682
OmniACT Real βœ”οΈ βœ”οΈ 27 736
General & Professional WebRetriever Real βœ”οΈ βœ”οΈ 800 1,550
(e) Comparison between WebRetriever and related benchmarks. Intent-Type: task intent type; Setting: the evaluation environment configuration; Online: whether online live connection evaluation is supported; Interactive: whether the environment allows interaction.
**Note: Reported statistics are based solely on the web-related evaluation subsets used for benchmarking. For datasets that additionally contain training splits or non-web CUA tasks, those portions are excluded from the comparison.**

🧠 NavEval

Current LLM-based evaluation pipelines exhibit limitations in organizing and utilizing evaluation signals. Existing methods employ three screenshot strategies: (1) full-trajectory screenshots, which provide comprehensive coverage but suffer from redundancy and high computational costs; (2) final screenshot only, which captures task completion status but misses critical intermediate process information; and (3) LLM-selected key screenshots, which reduce redundancy but lack sufficient accuracy in key step selection and matching. These limitations reveal the core challenge of automated evaluation: how to extract critical information to enhance LLM judgment accuracy. To address these limitations, we propose NavEval, which integrates more diverse information from agent-browser interactions. Beyond actions and final screenshots, NavEval leverages all navigation URLs and network requests generated during interactions. Through rule-based filtering to eliminate substantial noise, these signals are restructured into organized information that enhances LLM judgment accuracy.

webretriever naveval Workflow of NavEval. Compared to existing methods, NavEval applies rule-based filtering to extract fine-grained intermediate signals, which are then jointly reasoned with the task description, action trajectory, and final screenshot by an LLM to determine task success, enabling robust evaluation with higher human agreement rates.

πŸ“‹ Evaluation Protocols

To comprehensively evaluate web agent capabilities, we design three protocols reflecting realistic deployment scenarios. Protocol I assesses fundamental navigation ability by measuring whether agents can reach target pages given user tasks. Protocol II evaluates navigation performance when agents are provided with operational documentation. For Protocol II, operational knowledge is consolidated into structured documentation through a closed-loop framework. Web agents first generate action trajectories for target tasks, which NavEval evaluates automatically. Correct trajectories proceed to annotation, while incorrect ones are regenerated until reaching retry limits, after which human annotators refine them. The annotation platform further verifies accepted trajectories and streamlines redundant operations. Finally, LLMs generate operation manuals from these refined trajectories, completing the documentation pipeline. While Protocols I and II focus on navigation, realistic web environments require capabilities beyond page navigation. Agents must demonstrate deep page understanding, multi-source information integration, and precise extraction capabilities critical for holistic evaluation. Protocol III mirrors realistic deployment by assessing whether agents can accurately retrieve target information across multiple modalities (text, documents, charts).
Based on these designs, we collect 1,000, 1,000, and 100 tasks for Protocols I, II, and III, respectively. Notably, Protocols I and II share 550 overlapping tasks, with the sole difference being the availability of operational documentation.

webretriever operational documentation Workflow of the semi-automated pipeline for constructing operational documentation in Protocol II. The process integrates automated exploration, evaluation, manual refinement, and LLM-based generation to produce high-quality operational documentation.

πŸ† Leaderboard

Important Notice:

Results not listed on this leaderboard have not been verified by WebRetriever Team, and their accuracy is not guaranteed.

Evaluation Protocols:

(1) Protocol IEvaluates basic navigation ability to reach target pages.

(2) Protocol IIAssesses navigation performance when provided with operational knowledge.

(3) Protocol IIIMeasures end-to-end task completion by jointly evaluating navigation and information extraction, avoiding the limitation of equating page arrival with task success.

For Protocols I and II, we maintain both human-verified success rates and NavEval automated success rates. For Protocol III, since it evaluates end-to-end task completion (including information extraction), only human-verified success rates are reported.

πŸ“Œ Note: The Protocol III evaluation set has been expanded from the original 50 tasks (as reported in the paper) to 100 tasks for more comprehensive assessment. As a result, Protocol III scores on this leaderboard may differ from those reported in the paper.

Auto Evaluation:

In our implementation, NavEval’s judgment module adopts Claude-4.5-Sonnet to ensure stable semantic understanding and reasoning.

Loading verified benchmark data...

Evaluation Guideline

To facilitate transparent and reproducible evaluation, we will publicly release the complete evaluation framework, including implementations of all three evaluation protocols, the NavEval automated evaluation pipeline, and human evaluation guidelines.

BibTeX

@article{webretriever-2026,
      title={WebRetriever: A Large-Scale Comprehensive Benchmark for Efficient Web Agent Evaluation},
      author={Wei Dong and Tianyu Fu and Zhe Yu and Hanning Wang and Anyang Su and Zhizhou Fang and Yuyang Chen and Shuo Wang and Minghui Wu and Ping Jiang and Zhen Lei and Chenxu Zhao},
      year={2026},
      note={To be published},
      url={https://github.com/hhhhhhalf/WebRetriever}
}