This research addresses the gap between strong LLM performance on academic web agent benchmarks and limited real-world adoption for structured data extraction tasks. The authors introduce WebLists — a benchmark featuring 200 tasks across four business use cases — and propose BardeenAgent, which converts agent actions into repeatable programs.
Abstract
Existing agents achieve only 3–31% recall on real-world structured extraction tasks. BardeenAgent, which converts agent actions into repeatable programs and replays them across pages with similar structure, achieves 66% recall while reducing costs threefold. The benchmark spans 50 major cloud companies and evaluates deterministically using website-specific scripts rather than LLM judges.
1. Introduction
The paper identifies two limitations in existing benchmarks: they focus on navigation rather than structured data extraction, and involve few websites (4–15 compared to WebLists' 50). The authors propose "interactive schema-bound data extraction across websites," where agents must navigate, manipulate interactive elements, and extract structured data adhering to predefined schemas.
BardeenAgent operates in two phases: Record (agent navigates and records generalizable CSS selectors while operating only on first list items) and Replay (recorded actions convert to executable programs with loops for list items).
2. The WebLists Benchmark
The benchmark comprises 200 tasks across four use cases:
- Extracting job postings — 5,524 rows
- Extracting blog posts — 2,589 rows
- Extracting customer testimonials — 2,943 rows
- Filtered jobs by category/location — 1,081 rows
Tasks span 50 major cloud companies, with deterministic evaluation using website-specific scripts rather than LLM judges — ensuring reproducibility and resistance to prompt gaming.
3. BardeenAgent Architecture
The recording phase converts temporary DOM IDs to robust CSS selectors. The EnterList tool enables scoped extraction contexts with pagination controls. List navigation is handled by first extracting URLs, then performing a second pass for detailed information.
Selector generation uses heuristic approaches that sample diverse CSS strategies — parent-child, position, classes, attributes — and filter non-matching candidates. LLM-based generation handles complex cases lacking clean common parents.
4. Evaluation Results
Performance comparison shows BardeenAgent achieving 66.2% overall recall versus Agent-E's 12.1% and Wilbur's 30.5%. Precision is 72.5% for BardeenAgent. Cost analysis reveals BardeenAgent's amortized cost per extracted row is $1.07 versus $3.21–$4.55 for baselines.
"Existing agents achieve only 3–31% recall on structured extraction tasks — BardeenAgent's program-synthesis approach closes this gap while reducing cost threefold."
5. Conclusion
The authors establish WebLists as a rigorous benchmark for live website evaluation and position BardeenAgent as a significant advancement in executable LLM agents for data extraction. They note that real-world applications require even higher performance standards — motivating further research into generalizable program synthesis for web automation.