← all repositories
beir-cellar/beir

One benchmark to stress-test them all: retrieval models meet 17 datasets

BEIR gives retrieval researchers a single Python framework to evaluate dense, sparse, lexical, and reranking models across diverse IR tasks without dataset wrangling.

beir
Velocity · 7d
+1.1
★ / day
Trend
steady
star history

What it does

BEIR is a Python toolkit that bundles 17 preprocessed information-retrieval datasets and a common evaluation harness. You bring a model—Sentence-BERT, a HuggingFace encoder, even a LoRA-tuned vLLM instance or a Cohere API call—and BEIR handles downloading corpora, running retrieval, and computing NDCG, MAP, Recall, Precision, and MRR at standard cutoffs. It is essentially glue code, but glue code that saves you from writing the same evaluation boilerplate for the seventeenth time.

The interesting bit

The “heterogeneous” part is not marketing fluff. The datasets span scientific fact-checking, FAQ retrieval, bio-medical search, and web passage ranking, so a model that aces one domain can still embarrass itself on another. BEIR exposes that variance deliberately, making it harder to cherry-pick a leaderboard win.

Key highlights

  • 17 benchmark datasets ready to download and load via GenericDataLoader
  • Supports lexical, dense, sparse, and reranking architectures in one framework
  • Built-in metrics: NDCG@k, MAP@k, Recall@k, Precision@k, and MRR for k up to 1000
  • Wrappers for SBERT, HuggingFace transformers (with Flash Attention 2), vLLM with LoRA, and third-party APIs like Cohere
  • Python 3.9+, pip-installable, with Colab notebooks and a Hugging Face hub presence

Caveats

  • The README is enthusiastic but thin on dataset documentation; you will need the wiki or the original papers to understand what each dataset actually measures
  • Some newer paths (vLLM, LoRA) require extra dependencies—peft, accelerate, vllm, faiss-cpu—that are not in the base install

Verdict

If you are building or comparing retrieval models and need a sanity check across domains, BEIR is the closest thing to a standard yardstick. If you only care about one narrow retrieval task, it is overkill—just use that task’s native scripts.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.