Is VibeSearchBench open source?

Yes — VibeBench/VibeSearchBench is open source, released under the MIT license.

What language is VibeSearchBench written in?

VibeBench/VibeSearchBench is primarily written in Python.

How popular is VibeSearchBench?

VibeBench/VibeSearchBench has 915 stars on GitHub.

Where can I find VibeSearchBench?

VibeBench/VibeSearchBench is on GitHub at https://github.com/VibeBench/VibeSearchBench.

← all repositories

VibeBench/VibeSearchBench

A search benchmark where even the best LLM scores 30%

VibeSearchBench tests whether agents can handle vague, evolving queries the way real humans actually ask them.

★915 stars Python LLMOps · Eval Agents

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

VibeSearchBench is a 200-task benchmark for multi-turn search agents. Each task starts with a vague query and a hidden ground-truth knowledge graph; a persona-driven user simulator progressively discloses more intent as the agent asks follow-ups or returns partial results. The agent can search the web, visit pages, and run Python across many turns. Scoring uses triplet F1 against the ground-truth KG, judged by an LLM evaluating semantic equivalence of entities and relations.

The interesting bit

The “best reported score” is 30.3 triplet F1. That is not a typo. The benchmark is deliberately adversarial: real users do not hand you a spec sheet, and the evaluation is schema-free — no predetermined ontology, just whether the extracted knowledge graph matches reality. The progressive disclosure mechanism forces bidirectional convergence rather than one-shot retrieval.

Key highlights

200 tasks split evenly between professional research (literature reviews, due diligence) and daily-life search (shopping, travel with evolving preferences)
Two reference agent implementations: GeneralAgent (OpenAI-compatible LLM with optional multi-agent mode) and OpenClaw wrapper (CLI-based agent)
Evaluation via LLM-as-judge: node alignment, then triplet semantic equivalence, with avg@N and best@N aggregation
Full pipeline scripts for inference + evaluation, or either separately
Dataset and paper available on Hugging Face

Caveats

Requires multiple external services: Serper for search, a vLLM endpoint for the main model, another for summarization, and a Gemini endpoint for grading (or OpenAI-compatible alternatives)
The 30.3 F1 ceiling suggests either the benchmark is extremely hard, current agents are inadequate, or both; the README does not analyze which
“OpenClaw” is referenced without explanation of what it is; appears to be an external CLI tool you must already have running

Verdict

Worth studying if you are building search agents, evaluating LLM tool use, or skeptical that current systems handle real-world vague queries. Skip if you want a plug-and-play leaderboard entry — the setup is nontrivial and the ceiling is humbling.

Frequently asked

What is VibeBench/VibeSearchBench?: VibeSearchBench tests whether agents can handle vague, evolving queries the way real humans actually ask them.
Is VibeSearchBench open source?: Yes — VibeBench/VibeSearchBench is open source, released under the MIT license.
What language is VibeSearchBench written in?: VibeBench/VibeSearchBench is primarily written in Python.
How popular is VibeSearchBench?: VibeBench/VibeSearchBench has 915 stars on GitHub.
Where can I find VibeSearchBench?: VibeBench/VibeSearchBench is on GitHub at https://github.com/VibeBench/VibeSearchBench.