← all repositories
PaddlePaddle/RocketQA

Baidu's dense-retrieval toolkit ships ready-to-run SOTA models

A Python wrapper around ERNIE-based encoders that gets you from pip install to a working neural search engine in a few lines, with the first open-source Chinese dense retrieval model included.

RocketQA
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

RocketQA is a Python toolkit for dense passage retrieval and question answering. It wraps pre-trained dual-encoder and cross-encoder models (built on Baidu’s ERNIE architecture) behind a simple API: load a model, encode queries and paragraphs, get similarity scores. The package also bundles ready-made examples for Jina and Faiss, so you can stand up an end-to-end search service without writing much glue code.

The interesting bit

The project claims the first open-source Chinese dense retrieval model, trained on millions of manually annotated query-passage pairs from Baidu’s DuReader dataset. That’s notable because most open retrieval toolkits have historically skewed heavily English-centric. The README also tracks a steady research lineage—RocketQA v1 (NAACL 2021), PAIR (ACL 2021), RocketQA v2 (EMNLP 2021), and DuReader_retrieval (EMNLP 2022)—which suggests the toolkit is kept in sync with the group’s published work rather than drifting from it.

Key highlights

  • Ships with pre-trained models for both Chinese and English retrieval; available_models() lists what’s on offer.
  • Dual encoders for vector search, cross encoders for re-ranking—standard architecture, but exposed through a uniform load_model() interface.
  • Jina integration example lets you index documents and query them via CLI with minimal boilerplate.
  • Training API supports fine-tuning on custom TSV data (query, title, paragraph, label) without leaving the library.
  • Docker image available if you prefer not to wrestle with PaddlePaddle GPU dependencies locally.

Caveats

  • Hard dependency on PaddlePaddle 2.0+ and Python 3.6+; no PyTorch or TensorFlow escape hatch.
  • The “SOTA” claims are backed by the group’s own papers, but no third-party leaderboard links or independent benchmark numbers appear in the README.
  • Last meaningful toolkit update appears to be April 2022 (training functions added); the research papers continue, but it’s unclear how actively the package itself is maintained.

Verdict

Worth a look if you need Chinese-language dense retrieval out of the box or want to prototype a neural search pipeline quickly without tuning transformer architectures yourself. Skip it if you’re already invested in PyTorch ecosystems like Hugging Face or need fine-grained control over model internals.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.