Baidu's dense-retrieval toolkit ships ready-to-run SOTA models
A Python wrapper around ERNIE-based encoders that gets you from pip install to a working neural search engine in a few lines, with the first open-source Chinese dense retrieval model included.

What it does
RocketQA is a Python toolkit for dense passage retrieval and question answering. It wraps pre-trained dual-encoder and cross-encoder models (built on Baidu’s ERNIE architecture) behind a simple API: load a model, encode queries and paragraphs, get similarity scores. The package also bundles ready-made examples for Jina and Faiss, so you can stand up an end-to-end search service without writing much glue code.
The interesting bit
The project claims the first open-source Chinese dense retrieval model, trained on millions of manually annotated query-passage pairs from Baidu’s DuReader dataset. That’s notable because most open retrieval toolkits have historically skewed heavily English-centric. The README also tracks a steady research lineage—RocketQA v1 (NAACL 2021), PAIR (ACL 2021), RocketQA v2 (EMNLP 2021), and DuReader_retrieval (EMNLP 2022)—which suggests the toolkit is kept in sync with the group’s published work rather than drifting from it.
Key highlights
- Ships with pre-trained models for both Chinese and English retrieval;
available_models()lists what’s on offer. - Dual encoders for vector search, cross encoders for re-ranking—standard architecture, but exposed through a uniform
load_model()interface. - Jina integration example lets you index documents and query them via CLI with minimal boilerplate.
- Training API supports fine-tuning on custom TSV data (query, title, paragraph, label) without leaving the library.
- Docker image available if you prefer not to wrestle with PaddlePaddle GPU dependencies locally.
Caveats
- Hard dependency on PaddlePaddle 2.0+ and Python 3.6+; no PyTorch or TensorFlow escape hatch.
- The “SOTA” claims are backed by the group’s own papers, but no third-party leaderboard links or independent benchmark numbers appear in the README.
- Last meaningful toolkit update appears to be April 2022 (training functions added); the research papers continue, but it’s unclear how actively the package itself is maintained.
Verdict
Worth a look if you need Chinese-language dense retrieval out of the box or want to prototype a neural search pipeline quickly without tuning transformer architectures yourself. Skip it if you’re already invested in PyTorch ecosystems like Hugging Face or need fine-grained control over model internals.