← all repositories

Ayanami0730/deep_research_bench

A benchmark suite for evaluating AI agents on deep research tasks with automatic scoring and a public leaderboard.

747 stars Python LLMOps · EvalAgents
deep_research_bench
Velocity · 7d
+2.1
★ / day
Trend
steady
star history

DeepResearch Bench is a comprehensive evaluation framework for assessing AI agents capable of deep research. The project provides a dataset of research tasks with human-annotated reference answers, an automatic evaluator using frontier models (GPT-5.5, GPT-5.4-mini, Gemini-2.5-Pro) to score agent outputs on dimensions like Overall quality, PAR, and FAS, and a public leaderboard hosted on Hugging Face for comparing agent performance.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.