← all repositories

nlmatics/llmsherpa

PDF/document parser with layout awareness that extracts structured sections, paragraphs, tables and lists for LLM chunking pipelines.

1.7k stars Jupyter Notebook Data ToolingRAG · Search
llmsherpa
Velocity · 7d
+1.8
★ / day
Trend
steady
star history

LLM Sherpa provides APIs for parsing and chunking documents with hierarchical layout information. The LayoutPDFReader extracts sections, subsections, paragraphs, tables, and lists while removing headers, footers and watermarks. It helps developers create optimal text chunks for vectorization and solves context window limitations by joining content spread across pages. The backend service is now fully open sourced as nlm-ingestor.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.