Wikipedia, but make it vectors: phrase-level search at billion-scale
A retrieval system that indexes individual phrases from all of Wikipedia so you can search by semantic meaning, not keyword matching.

What it does
DensePhrases learns dense vector representations for phrases—billions of them, drawn from the entire Wikipedia—and retrieves them in real time for natural language queries. You can ask for phrases, sentences, passages, or whole documents, and it returns semantically relevant matches without relying on traditional keyword indexing.
The interesting bit
Most retrieval systems index documents or passages. This one indexes phrases—the atomic unit of meaning—and still scales to billions of vectors. The same model can also retrieve coarser-grained units (passages, documents) by learning multi-granularity representations, which is the focus of their follow-up EMNLP 2021 paper.
Key highlights
- Pre-trained models on Hugging Face; load with a few lines of Python
- 74GB phrase index for full Wikipedia (2018.12.20); smaller 20–60GB indexes available
- Query-side fine-tuning adapts the retriever to specific QA datasets (NQ, WebQ, TriviaQA, etc.)
- Supports downstream tasks: open-domain QA, entity linking, slot filling, knowledge-grounded dialogue
- Online demo runs at densephrases.korea.ac.kr
Caveats
mainbranch pins totransformers==2.9.0(Python 3.7); v1.1.0 upgrades totransformers==4.13.0but requires branch switching- Full phrase index is 74GB uncompressed; you’ll need serious disk and RAM
- Installation requires NVIDIA apex and CUDA-aware PyTorch—no quick pip install
Verdict
Worth a look if you’re building retrieval-augmented generation or open-domain QA and need sub-passage precision. Skip it if you’re after lightweight, plug-and-play search; this is research infrastructure with the hardware demands to match.