Is thepipe open source?

Yes — emcf/thepipe is open source, released under the MIT license.

What language is thepipe written in?

emcf/thepipe is primarily written in Python.

How popular is thepipe?

emcf/thepipe has 1.5k stars on GitHub.

Where can I find thepipe?

emcf/thepipe is on GitHub at https://github.com/emcf/thepipe.

← all repositories

emcf/thepipe

A Document Scraper That Actually Looks at the Page

thepipe extracts clean, structured data from messy documents by feeding them to vision-language models instead of relying on brittle text parsers.

★1.5k stars Python Data Tooling RAG · Search

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

thepipe ingests a wide range of sources—PDFs, Word documents, PowerPoint slides, web pages, Python notebooks, video, audio, and even GitHub repositories—and extracts clean markdown, tables, images, and text. It combines computer-vision models with heuristics, and can escalate tricky or scanned documents to a vision-language model for better accuracy. The results are packaged into chunks ready for LLMs, vector databases, or RAG frameworks.

The interesting bit

Instead of treating every file as a bag of text, thepipe uses VLMs to reason about layout and visuals, which is why it can handle scanned PDFs and complex slides that traditional scrapers mangle. It also decouples scraping from chunking, so you can re-split the same extracted content by page, section, semantic shift, or an LLM agent without re-processing the source file.

Key highlights

Multimodal extraction from PDFs, Office docs, web pages, video, audio, and GitHub repos
Pluggable VLM backend: works with OpenAI by default, or any OpenAI-compatible client such as OpenRouter or a local server
Seven chunking strategies, including experimental semantic splitting and an LLM-based agentic splitter
Lightweight base install suitable for CI; heavier dependencies like PyTorch and Whisper are optional extras
Native export to OpenAI chat message format or LlamaIndex Document/ImageDocument objects

Caveats

The structured extraction feature is deprecated and will be removed in future releases
Semantic and agentic chunking are marked experimental
Full support for media-rich sources requires additional system dependencies such as ffmpeg and Playwright

Verdict

Worth a look if you are building RAG or document-QA pipelines and tired of parsers that choke on scanned PDFs or mixed-layout slides. Skip it if you only need simple text extraction from clean, predictable sources.

Frequently asked

What is emcf/thepipe?: thepipe extracts clean, structured data from messy documents by feeding them to vision-language models instead of relying on brittle text parsers.
Is thepipe open source?: Yes — emcf/thepipe is open source, released under the MIT license.
What language is thepipe written in?: emcf/thepipe is primarily written in Python.
How popular is thepipe?: emcf/thepipe has 1.5k stars on GitHub.
Where can I find thepipe?: emcf/thepipe is on GitHub at https://github.com/emcf/thepipe.