nlmatics/llmsherpa
PDF/document parser with layout awareness that extracts structured sections, paragraphs, tables and lists for LLM chunking pipelines.

Velocity · 7d
+1.8
★ / day
Trend
→steady
star history
LLM Sherpa provides APIs for parsing and chunking documents with hierarchical layout information. The LayoutPDFReader extracts sections, subsections, paragraphs, tables, and lists while removing headers, footers and watermarks. It helps developers create optimal text chunks for vectorization and solves context window limitations by joining content spread across pages. The backend service is now fully open sourced as nlm-ingestor.