Is DensePhrases open source?

Yes — princeton-nlp/DensePhrases is open source, released under the Apache-2.0 license.

What language is DensePhrases written in?

princeton-nlp/DensePhrases is primarily written in Python.

How popular is DensePhrases?

princeton-nlp/DensePhrases has 607 stars on GitHub.

Where can I find DensePhrases?

princeton-nlp/DensePhrases is on GitHub at https://github.com/princeton-nlp/DensePhrases.

← all repositories

princeton-nlp/DensePhrases

Wikipedia, but make it vectors: phrase-level search at billion-scale

A retrieval system that indexes individual phrases from all of Wikipedia so you can search by semantic meaning, not keyword matching.

★607 stars Python RAG · Search Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

DensePhrases learns dense vector representations for phrases—billions of them, drawn from the entire Wikipedia—and retrieves them in real time for natural language queries. You can ask for phrases, sentences, passages, or whole documents, and it returns semantically relevant matches without relying on traditional keyword indexing.

The interesting bit

Most retrieval systems index documents or passages. This one indexes phrases—the atomic unit of meaning—and still scales to billions of vectors. The same model can also retrieve coarser-grained units (passages, documents) by learning multi-granularity representations, which is the focus of their follow-up EMNLP 2021 paper.

Key highlights

Pre-trained models on Hugging Face; load with a few lines of Python
74GB phrase index for full Wikipedia (2018.12.20); smaller 20–60GB indexes available
Query-side fine-tuning adapts the retriever to specific QA datasets (NQ, WebQ, TriviaQA, etc.)
Supports downstream tasks: open-domain QA, entity linking, slot filling, knowledge-grounded dialogue
Online demo runs at densephrases.korea.ac.kr

Caveats

main branch pins to transformers==2.9.0 (Python 3.7); v1.1.0 upgrades to transformers==4.13.0 but requires branch switching
Full phrase index is 74GB uncompressed; you’ll need serious disk and RAM
Installation requires NVIDIA apex and CUDA-aware PyTorch—no quick pip install

Verdict

Worth a look if you’re building retrieval-augmented generation or open-domain QA and need sub-passage precision. Skip it if you’re after lightweight, plug-and-play search; this is research infrastructure with the hardware demands to match.

Frequently asked

What is princeton-nlp/DensePhrases?: A retrieval system that indexes individual phrases from all of Wikipedia so you can search by semantic meaning, not keyword matching.
Is DensePhrases open source?: Yes — princeton-nlp/DensePhrases is open source, released under the Apache-2.0 license.
What language is DensePhrases written in?: princeton-nlp/DensePhrases is primarily written in Python.
How popular is DensePhrases?: princeton-nlp/DensePhrases has 607 stars on GitHub.
Where can I find DensePhrases?: princeton-nlp/DensePhrases is on GitHub at https://github.com/princeton-nlp/DensePhrases.