nlmatics/nlm-ingestor
A document parsing service that provides RAG-friendly parsers for PDF, HTML, text, DOCX, and PPTX formats with layout awareness and OCR support.

The repository contains custom parsers optimized for retrieval augmented generation workflows. The PDF parser extracts text with coordinates, sections, paragraphs, tables, and lists, with optional OCR for scanned pages. The HTML parser creates layout-aware blocks for better RAG chunk quality, while the text parser infers structure from content alone. All parsers feed into the llmsherpa API to prepare documents for ingestion into LLM-powered applications.