Ayanami0730/deep_research_bench
A benchmark suite for evaluating AI agents on deep research tasks with automatic scoring and a public leaderboard.

Velocity · 7d
+2.1
★ / day
Trend
→steady
star history
DeepResearch Bench is a comprehensive evaluation framework for assessing AI agents capable of deep research. The project provides a dataset of research tasks with human-annotated reference answers, an automatic evaluator using frontier models (GPT-5.5, GPT-5.4-mini, Gemini-2.5-Pro) to score agent outputs on dimensions like Overall quality, PAR, and FAS, and a public leaderboard hosted on Hugging Face for comparing agent performance.