← all repositories

nlmatics/nlm-ingestor

A document parsing service that provides RAG-friendly parsers for PDF, HTML, text, DOCX, and PPTX formats with layout awareness and OCR support.

1.3k stars Python RAG · SearchData Tooling
nlm-ingestor
Velocity · 7d
+1.5
★ / day
Trend
steady
star history

The repository contains custom parsers optimized for retrieval augmented generation workflows. The PDF parser extracts text with coordinates, sections, paragraphs, tables, and lists, with optional OCR for scanned pages. The HTML parser creates layout-aware blocks for better RAG chunk quality, while the text parser infers structure from content alone. All parsers feed into the llmsherpa API to prepare documents for ingestion into LLM-powered applications.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.