Is RocketQA open source?

Yes — PaddlePaddle/RocketQA is open source, released under the Apache-2.0 license.

What language is RocketQA written in?

PaddlePaddle/RocketQA is primarily written in Python.

How popular is RocketQA?

PaddlePaddle/RocketQA has 785 stars on GitHub.

Where can I find RocketQA?

PaddlePaddle/RocketQA is on GitHub at https://github.com/PaddlePaddle/RocketQA.

← all repositories

PaddlePaddle/RocketQA

Baidu's dense-retrieval toolkit ships ready-to-run SOTA models

A Python wrapper around ERNIE-based encoders that gets you from pip install to a working neural search engine in a few lines, with the first open-source Chinese dense retrieval model included.

★785 stars Python RAG · Search Language Models Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

RocketQA is a Python toolkit for dense passage retrieval and question answering. It wraps pre-trained dual-encoder and cross-encoder models (built on Baidu’s ERNIE architecture) behind a simple API: load a model, encode queries and paragraphs, get similarity scores. The package also bundles ready-made examples for Jina and Faiss, so you can stand up an end-to-end search service without writing much glue code.

The interesting bit

The project claims the first open-source Chinese dense retrieval model, trained on millions of manually annotated query-passage pairs from Baidu’s DuReader dataset. That’s notable because most open retrieval toolkits have historically skewed heavily English-centric. The README also tracks a steady research lineage—RocketQA v1 (NAACL 2021), PAIR (ACL 2021), RocketQA v2 (EMNLP 2021), and DuReader_retrieval (EMNLP 2022)—which suggests the toolkit is kept in sync with the group’s published work rather than drifting from it.

Key highlights

Ships with pre-trained models for both Chinese and English retrieval; available_models() lists what’s on offer.
Dual encoders for vector search, cross encoders for re-ranking—standard architecture, but exposed through a uniform load_model() interface.
Jina integration example lets you index documents and query them via CLI with minimal boilerplate.
Training API supports fine-tuning on custom TSV data (query, title, paragraph, label) without leaving the library.
Docker image available if you prefer not to wrestle with PaddlePaddle GPU dependencies locally.

Caveats

Hard dependency on PaddlePaddle 2.0+ and Python 3.6+; no PyTorch or TensorFlow escape hatch.
The “SOTA” claims are backed by the group’s own papers, but no third-party leaderboard links or independent benchmark numbers appear in the README.
Last meaningful toolkit update appears to be April 2022 (training functions added); the research papers continue, but it’s unclear how actively the package itself is maintained.

Verdict

Worth a look if you need Chinese-language dense retrieval out of the box or want to prototype a neural search pipeline quickly without tuning transformer architectures yourself. Skip it if you’re already invested in PyTorch ecosystems like Hugging Face or need fine-grained control over model internals.

Frequently asked

What is PaddlePaddle/RocketQA?: A Python wrapper around ERNIE-based encoders that gets you from pip install to a working neural search engine in a few lines, with the first open-source Chinese dense retrieval model included.
Is RocketQA open source?: Yes — PaddlePaddle/RocketQA is open source, released under the Apache-2.0 license.
What language is RocketQA written in?: PaddlePaddle/RocketQA is primarily written in Python.
How popular is RocketQA?: PaddlePaddle/RocketQA has 785 stars on GitHub.
Where can I find RocketQA?: PaddlePaddle/RocketQA is on GitHub at https://github.com/PaddlePaddle/RocketQA.