WebRetriever: A Large-Scale Comprehensive Benchmark for Efficient Web Agent Evaluation

💡 Motivation

Existing web agent evaluation suffers from three key limitations: (1) insufficient benchmark coverage — offline benchmarks lack real-world fidelity, while online benchmarks remain limited in website scale, domain diversity, and intent variety, leading to biased and overly optimistic assessments; (2) unscalable evaluation methods — human annotation is costly and impractical at scale, and current automated approaches lack sufficient accuracy on complex queries; (3) narrow evaluation dimensions — existing protocols fail to systematically assess agents' ability to leverage external knowledge and execute end-to-end workflows required for real-world deployment. WebRetriever addresses all three gaps through large-scale diverse benchmarking, a reliable automated evaluation method, and deployment-oriented evaluation protocols.

Motivation for the WebRetriever benchmark. WebRetriever addresses key limitations of prior work from three aspects: dataset scale and diversity, automated evaluation reliability, and deployment-oriented evaluation protocols.

📄 Abstract

As web agents increasingly demonstrate capabilities in automated task execution, the development of robust evaluation frameworks for assessing their navigation and task completion performance has emerged as a critical research priority. However, existing benchmarks exhibit several fundamental limitations. First, they suffer from insufficient scale and limited domain diversity, thereby constraining comprehensive evaluation of cross-domain generalization. Second, prevailing LLM-as-Judge evaluation methodologies inadequately capture fine-grained interaction semantics, particularly regarding precise query formulation and filtering operations. Third, current benchmarks predominantly emphasize navigation success metrics while neglecting critical requirements for real-world deployment scenarios. To address these limitations, we introduce WebRetriever, a large-scale benchmark encompassing 800 websites and 1,550 tasks across diverse domains, including consumer, professional, and enterprise sectors, with comprehensive coverage of user intent patterns. We propose NavEval (Navigation Evaluation), a novel LLM-as-Judge framework that leverages rich interaction context beyond visual screenshots, achieving state-of-the-art alignment with human judgment across multiple evaluation datasets. Furthermore, we establish three complementary evaluation protocols that collectively provide holistic assessment of web agent capabilities: navigation proficiency, knowledge-assisted interaction, and end-to-end task completion with information extraction. Extensive experimental analysis reveals substantial performance disparities across evaluation protocols, demonstrating that navigation success alone serves as an insufficient predictor of real-world application effectiveness. WebRetriever delivers fine-grained diagnostic insights into agent capabilities and establishes a rigorous foundation for advancing web agent research and development.

⭐ Main Contributions

(1) A large-scale, comprehensive benchmark for realistic web agent evaluation. We curate 1,550 tasks across 800 real websites spanning diverse domains and user intents. Compared with prior benchmarks, WebRetriever provides unprecedented scale, diversity, and coverage, enabling more comprehensive and representative evaluation of web agents in realistic online environments.

(2) A general and high-precision automated evaluation method. We propose NavEval, an automated evaluation method that attains approximately 90% human-level agreement in large-scale experiments, thereby enabling practical and reliable assessment of web agent performance at scale and in real-time.

(3) Comprehensive evaluation framework. We propose three complementary evaluation protocols to systematically assess web agents, explicitly disentangling navigation success from answer correctness and characterizing behavioral reliability under injected operational knowledge, thereby providing diagnostic signals missing from prior benchmarks.

📊 Data Statistics & Comparison

Compared with prior benchmarks, WebRetriever encompasses 1,550 cross-industry tasks and over 800 carefully curated high-quality active websites. To capture the fundamental landscape of mainstream internet behavior, we leverage SimilarWeb traffic data as our baseline, targeting eight core sectors including Technology & Internet, Business & Finance, Education & Research, Entertainment & Travel, Life Services, Healthcare, Industrial Manufacturing, and Public Services, WebRetriever spans eight core industry sectors closely related to everyday life and work. We select the top 30 sites by traffic within each sector, and additionally incorporate specialized vertical sites from authoritative internet research institutions (e.g., 199it) that maintain comprehensive data navigation repositories. This approach particularly strengthens our dataset's evaluation capabilities in business intelligence domains while ensuring broad website type distribution. At the task design level, a diverse team comprising industry experts, mid-level managers, senior data analysts, and university students contributes tasks for websites within their domains of expertise, grounded in authentic user intents. Building upon this foundation, we categorize task intents into two orthogonal dimensions: "general intents" and "professional intents". By combining these intent types with domain-specific "Fact Elements," we generate numerous tasks with distinct scenario characteristics.

(a) Website Domain Distribution

(b) Website Type Distribution

(c) General Intent Distribution

(d) Professional Intent Distribution

Intent-Type	Benchmarks	Setting	Online	Interactive	Websites	Eval-Tasks
General	MiniWoB	Synthetic	❌	✔️	100	100
	MiniWoB++	Synthetic	❌	✔️	100	100
	WebArena	Semi-real	❌	✔️	4	812
	VisualWebArena	Semi-real	❌	✔️	3	910
	WebLINX	Real	❌	❌	155	1,368
	Mind2Web	Real	❌	❌	137	1,341
	Mind2Web-Live	Real	✔️	✔️	46	104
	AssistantBench	Real	✔️	✔️	258	214
	WebVoyager	Real	✔️	✔️	15	643
	Bearcubs	Real	✔️	✔️	108	111
	Online-Mind2Web	Real	✔️	✔️	136	300
Professional	WebShop	Semi-real	❌	✔️	1	500
	ST-WebAgentBench	Semi-real	❌	✔️	3	222
	Wonderbread	Semi-real	❌	✔️	4	598
	WorkArena	Real	✔️	✔️	5	33
	WorkArena++	Real	✔️	✔️	5	682
	OmniACT	Real	✔️	✔️	27	736
General & Professional	WebRetriever	Real	✔️	✔️	800	1,550

(e) Comparison between WebRetriever and related benchmarks. Intent-Type: task intent type; Setting: the evaluation environment configuration; Online: whether online live connection evaluation is supported; Interactive: whether the environment allows interaction.

**Note: Reported statistics are based solely on the web-related evaluation subsets used for benchmarking. For datasets that additionally contain training splits or non-web CUA tasks, those portions are excluded from the comparison.**

🧠 NavEval

Current LLM-based evaluation pipelines exhibit limitations in organizing and utilizing evaluation signals. Existing methods employ three screenshot strategies: (1) full-trajectory screenshots, which provide comprehensive coverage but suffer from redundancy and high computational costs; (2) final screenshot only, which captures task completion status but misses critical intermediate process information; and (3) LLM-selected key screenshots, which reduce redundancy but lack sufficient accuracy in key step selection and matching. These limitations reveal the core challenge of automated evaluation: how to extract critical information to enhance LLM judgment accuracy. To address these limitations, we propose NavEval, which integrates more diverse information from agent-browser interactions. Beyond actions and final screenshots, NavEval leverages all navigation URLs and network requests generated during interactions. Through rule-based filtering to eliminate substantial noise, these signals are restructured into organized information that enhances LLM judgment accuracy.

Workflow of NavEval. Compared to existing methods, NavEval applies rule-based filtering to extract fine-grained intermediate signals, which are then jointly reasoned with the task description, action trajectory, and final screenshot by an LLM to determine task success, enabling robust evaluation with higher human agreement rates.

📋 Evaluation Protocols

To comprehensively evaluate web agent capabilities, we design three protocols reflecting realistic deployment scenarios. Protocol I assesses fundamental navigation ability by measuring whether agents can reach target pages given user tasks. Protocol II evaluates navigation performance when agents are provided with operational documentation. For Protocol II, operational knowledge is consolidated into structured documentation through a closed-loop framework. Web agents first generate action trajectories for target tasks, which NavEval evaluates automatically. Correct trajectories proceed to annotation, while incorrect ones are regenerated until reaching retry limits, after which human annotators refine them. The annotation platform further verifies accepted trajectories and streamlines redundant operations. Finally, LLMs generate operation manuals from these refined trajectories, completing the documentation pipeline. While Protocols I and II focus on navigation, realistic web environments require capabilities beyond page navigation. Agents must demonstrate deep page understanding, multi-source information integration, and precise extraction capabilities critical for holistic evaluation. Protocol III mirrors realistic deployment by assessing whether agents can accurately retrieve target information across multiple modalities (text, documents, charts).
Based on these designs, we collect 1,000, 1,000, and 100 tasks for Protocols I, II, and III, respectively. Notably, Protocols I and II share 550 overlapping tasks, with the sole difference being the availability of operational documentation.

Workflow of the semi-automated pipeline for constructing operational documentation in Protocol II. The process integrates automated exploration, evaluation, manual refinement, and LLM-based generation to produce high-quality operational documentation.

Evaluation Guideline

To facilitate transparent and reproducible evaluation, we will publicly release the complete evaluation framework, including implementations of all three evaluation protocols, the NavEval automated evaluation pipeline, and human evaluation guidelines.

BibTeX

@article{webretriever-2026,
      title={WebRetriever: A Large-Scale Comprehensive Benchmark for Efficient Web Agent Evaluation},
      author={Wei Dong and Tianyu Fu and Zhe Yu and Hanning Wang and Anyang Su and Zhizhou Fang and Yuyang Chen and Shuo Wang and Minghui Wu and Ping Jiang and Zhen Lei and Chenxu Zhao},
      year={2026},
      note={To be published},
      url={https://github.com/hhhhhhalf/WebRetriever}
}

🌐 WebRetriever: A Large-Scale Comprehensive Benchmark for Efficient Web Agent Evaluation