← all repositories
VibeBench/VibeSearchBench

A search benchmark where even the best LLM scores 30%

VibeSearchBench tests whether agents can handle vague, evolving queries the way real humans actually ask them.

827 stars Python LLMOps · EvalAgents
VibeSearchBench
Velocity · 7d
+43
★ / day
Trend
steady
star history

What it does

VibeSearchBench is a 200-task benchmark for multi-turn search agents. Each task starts with a vague query and a hidden ground-truth knowledge graph; a persona-driven user simulator progressively discloses more intent as the agent asks follow-ups or returns partial results. The agent can search the web, visit pages, and run Python across many turns. Scoring uses triplet F1 against the ground-truth KG, judged by an LLM evaluating semantic equivalence of entities and relations.

The interesting bit

The “best reported score” is 30.3 triplet F1. That is not a typo. The benchmark is deliberately adversarial: real users do not hand you a spec sheet, and the evaluation is schema-free — no predetermined ontology, just whether the extracted knowledge graph matches reality. The progressive disclosure mechanism forces bidirectional convergence rather than one-shot retrieval.

Key highlights

  • 200 tasks split evenly between professional research (literature reviews, due diligence) and daily-life search (shopping, travel with evolving preferences)
  • Two reference agent implementations: GeneralAgent (OpenAI-compatible LLM with optional multi-agent mode) and OpenClaw wrapper (CLI-based agent)
  • Evaluation via LLM-as-judge: node alignment, then triplet semantic equivalence, with avg@N and best@N aggregation
  • Full pipeline scripts for inference + evaluation, or either separately
  • Dataset and paper available on Hugging Face

Caveats

  • Requires multiple external services: Serper for search, a vLLM endpoint for the main model, another for summarization, and a Gemini endpoint for grading (or OpenAI-compatible alternatives)
  • The 30.3 F1 ceiling suggests either the benchmark is extremely hard, current agents are inadequate, or both; the README does not analyze which
  • “OpenClaw” is referenced without explanation of what it is; appears to be an external CLI tool you must already have running

Verdict

Worth studying if you are building search agents, evaluating LLM tool use, or skeptical that current systems handle real-world vague queries. Skip if you want a plug-and-play leaderboard entry — the setup is nontrivial and the ceiling is humbling.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.