← all repositories

explosion/spacy-layout

A spaCy plugin that converts PDFs and Word documents into structured data and spaCy Doc objects for downstream NLP and RAG processing.

903 stars Python Data ToolingRAG · Search
spacy-layout
Velocity · 7d
+1.6
★ / day
Trend
steady
star history

This plugin integrates with Docling to extract structured data from PDFs, Word documents, and other formats. It creates spaCy Doc objects with labelled text spans (sections, headings) and tables converted to pandas DataFrames. The resulting structured output enables linguistic analysis, named entity recognition, text classification, and chunking for RAG pipelines.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.