← all repositories
cdqa-suite/cdQA

A retired BERT pipeline that still teaches

cdQA was a Python toolkit for building closed-domain QA systems on your own documents, before its authors sent everyone to Haystack instead.

617 stars Python Language ModelsRAG · Search
cdQA
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

cdQA is an end-to-end question-answering pipeline that lets you point BERT (or DistilBERT) at your own document collection—PDFs, markdown, whatever—and ask it natural-language questions. It handles the full loop: converting documents into a pandas DataFrame, retrieving relevant paragraphs, and running a pre-trained reader to extract answers. There’s also a Flask API and a companion UI project if you want to wrap it in a web interface.

The interesting bit

The pipeline explicitly splits the problem into a retriever stage and a reader stage, then blends their scores with a tunable weight. That’s not exotic now, but the project arrived early enough that its Medium article and NLP Breakfast talk became reference material for people learning how BERT-based QA actually works.

Key highlights

  • Built on HuggingFace transformers, with ready-to-use BERT and DistilBERT readers fine-tuned on SQuAD 1.1
  • Includes converters for PDF and Markdown; needs Java OpenJDK for PDF parsing
  • Supports custom fine-tuning on SQuAD-like annotated data via a separate web annotator tool
  • Provides notebook tutorials runnable on Binder or Google Colab
  • Ships with a lightweight Flask API for deployment

Caveats

  • Not maintained. The README banner points users to Haystack as the actively supported alternative
  • Converter support is limited to PDF and Markdown; the README’s “plan to add more” never materialized
  • GPU experiments were run on a single Tesla V100; no explicit guidance on whether smaller hardware is viable

Verdict

Worth a look if you’re studying how retriever-reader QA pipelines work and want a clean, educational codebase to dissect. Skip it for production use; the authors themselves redirect you to Haystack.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.