← all repositories

emcf/thepipe

A document scraping library that uses vision-language models to extract structured markdown, tables, and media from PDFs, URLs, and other complex sources.

1.5k stars Python Data ToolingRAG · Search
thepipe
Velocity · 7d
+1.9
★ / day
Trend
steady
star history

The package parses complex documents including PDFs, Word docs, Powerpoints, and web pages using VLMs to produce clean markdown and structured data. It provides AI-native file-type detection, layout analysis, and multi-format support across documents, videos, and audio. The library integrates with popular RAG frameworks, vector databases, and works with any LLM or VLM provider.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.