← all repositories
namuan/dr-doc-search

Chat with your PDFs, but bring your own OCR

A CLI tool that turns static PDFs into conversational search targets using GPT-3 or local HuggingFace models.

596 stars Python RAG · SearchLanguage Models
dr-doc-search
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

Dr-doc-search ingests a PDF, rips it into page images, runs Tesseract OCR over them, builds a vector index, and exposes either a CLI Q&A mode or a local web UI (port 5006) where you can ask natural-language questions about the document’s contents. It started as an OpenAI-only tool; since v1.5.0 you can swap in HuggingFace embeddings and LLMs to keep your documents and your money local.

The interesting bit

The pipeline is deliberately low-tech: PDF → image → OCR → text chunks → embeddings. That makes it work on scanned books and image-heavy PDFs where pure text extraction fails, though it also means you’re one ImageMagick install away from dependency hell. The web UI is built with HoloViz Panel, which is an unusual but pragmatic choice for a solo dev tool.

Key highlights

  • Supports both OpenAI (GPT-3) and local HuggingFace models for embeddings and answers
  • Web interface and CLI modes; page-range filtering for large documents
  • Outputs working files (images, OCR text, index) to ~/OutputDir/dr-doc-search/<pdf-name> for inspection or debugging
  • PyPI installable; automated release pipeline via Poetry and GitHub Actions

Caveats

  • Requires manual installation of Tesseract OCR and ImageMagick; Windows users must set an IMCONV environment variable
  • The README notes OpenAI API costs apply after trial period, but doesn’t quantify typical indexing or query costs
  • No mention of concurrent users, rate limiting, or how the web UI behaves with large documents

Verdict

Worth a spin if you have a shelf of scanned PDFs and want to query them without uploading to a cloud service—provided you’re willing to wrangle OCR dependencies. Skip it if your PDFs are already text-native; simpler tools exist for that.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.