emcf/thepipe
A document scraping library that uses vision-language models to extract structured markdown, tables, and media from PDFs, URLs, and other complex sources.

Velocity · 7d
+1.9
★ / day
Trend
→steady
star history
The package parses complex documents including PDFs, Word docs, Powerpoints, and web pages using VLMs to produce clean markdown and structured data. It provides AI-native file-type detection, layout analysis, and multi-format support across documents, videos, and audio. The library integrates with popular RAG frameworks, vector databases, and works with any LLM or VLM provider.