Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.
chrome-extension benchmark evaluation dataset browser-automation ai-agents web-agent web-agents everyday-tasks browser-agent llm llm-evaluation agentic-ai computer-use browser-use agent-evaluation ai-agent-benchmark online-tasks chrome-agent real-world-benchmark
-
Updated
May 5, 2026 - Python