A retired BERT pipeline that still teaches
cdQA was a Python toolkit for building closed-domain QA systems on your own documents, before its authors sent everyone to Haystack instead.

What it does
cdQA is an end-to-end question-answering pipeline that lets you point BERT (or DistilBERT) at your own document collection—PDFs, markdown, whatever—and ask it natural-language questions. It handles the full loop: converting documents into a pandas DataFrame, retrieving relevant paragraphs, and running a pre-trained reader to extract answers. There’s also a Flask API and a companion UI project if you want to wrap it in a web interface.
The interesting bit
The pipeline explicitly splits the problem into a retriever stage and a reader stage, then blends their scores with a tunable weight. That’s not exotic now, but the project arrived early enough that its Medium article and NLP Breakfast talk became reference material for people learning how BERT-based QA actually works.
Key highlights
- Built on HuggingFace transformers, with ready-to-use BERT and DistilBERT readers fine-tuned on SQuAD 1.1
- Includes converters for PDF and Markdown; needs Java OpenJDK for PDF parsing
- Supports custom fine-tuning on SQuAD-like annotated data via a separate web annotator tool
- Provides notebook tutorials runnable on Binder or Google Colab
- Ships with a lightweight Flask API for deployment
Caveats
- Not maintained. The README banner points users to Haystack as the actively supported alternative
- Converter support is limited to PDF and Markdown; the README’s “plan to add more” never materialized
- GPU experiments were run on a single Tesla V100; no explicit guidance on whether smaller hardware is viable
Verdict
Worth a look if you’re studying how retriever-reader QA pipelines work and want a clean, educational codebase to dissect. Skip it for production use; the authors themselves redirect you to Haystack.