Is llm_aided_ocr open source?

Yes — Dicklesworthstone/llm_aided_ocr is an open-source project tracked on heatdrop.

What language is llm_aided_ocr written in?

Dicklesworthstone/llm_aided_ocr is primarily written in Python.

How popular is llm_aided_ocr?

Dicklesworthstone/llm_aided_ocr has 2.9k stars on GitHub.

Where can I find llm_aided_ocr?

Dicklesworthstone/llm_aided_ocr is on GitHub at https://github.com/Dicklesworthstone/llm_aided_ocr.

← all repositories

Dicklesworthstone/llm_aided_ocr

Tesseract’s copy editor is a large language model

Tesseract often mangles scanned text, so this pipeline feeds the raw OCR to an LLM for error correction, optional markdown formatting, and a quality score.

★2.9k stars Python Domain Apps Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This Python script ingests scanned PDFs, converts them to images, and extracts text with Tesseract. It then slices the raw OCR into overlapping sentence-bounded chunks and feeds them to an LLM—either a local GGUF model via llama_cpp or a cloud provider like OpenAI or Anthropic—to scrub errors, optionally reformat the whole thing as markdown, and strip duplicate paragraphs or headers. Finally, it runs a quality-assessment pass where the LLM compares the cleaned text against the original gibberish and assigns a score.

The interesting bit

The project is mostly well-engineered glue—pdf2image, pytesseract, and an LLM client held together by token-budget arithmetic (TOKEN_BUFFER, TOKEN_CUSHION) and careful chunk overlap—but that is exactly what makes it useful. It treats the LLM as a proofreader rather than a search engine, which keeps hallucinations somewhat constrained to the text actually on the page.

Key highlights

Supports both local LLMs (via llama_cpp) and cloud APIs (OpenAI, Anthropic), switchable through environment variables.
Chunks text at sentence boundaries with overlap so the LLM retains context without exceeding token limits.
Optional two-step pass: first correct OCR errors, then convert to markdown and deduplicate repeated paragraphs.
Includes a self-check function that compares the final output against the raw OCR and returns an LLM-generated quality score.
Can run fully offline with a local model and GPU acceleration, though the README warns that large documents will still devour time and compute.

Caveats

The README admits the final quality is entirely hostage to the LLM you choose; a cheap or small model will likely propagate or invent new errors.
Processing large documents is explicitly flagged as slow and resource-hungry, so this is not a low-latency pipeline.
You currently have to edit a variable inside the main() function to point at your PDF, which suggests the interface is still rough.

Verdict

Worth a look if you regularly digitize scanned books or archival PDFs and want cleaner text than Tesseract delivers out of the box. Skip it if you need a polished CLI tool or API service; this is a script you modify, not a product you deploy.

Frequently asked

What is Dicklesworthstone/llm_aided_ocr?: Tesseract often mangles scanned text, so this pipeline feeds the raw OCR to an LLM for error correction, optional markdown formatting, and a quality score.
Is llm_aided_ocr open source?: Yes — Dicklesworthstone/llm_aided_ocr is an open-source project tracked on heatdrop.
What language is llm_aided_ocr written in?: Dicklesworthstone/llm_aided_ocr is primarily written in Python.
How popular is llm_aided_ocr?: Dicklesworthstone/llm_aided_ocr has 2.9k stars on GitHub.
Where can I find llm_aided_ocr?: Dicklesworthstone/llm_aided_ocr is on GitHub at https://github.com/Dicklesworthstone/llm_aided_ocr.