← all repositories

paulpierre/markdown-crawler

A multithreaded web crawler that converts websites into markdown files for use in LLM RAG pipelines and knowledge bases.

445 stars Python Data ToolingRAG · Search
markdown-crawler
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

This tool recursively crawls websites and generates markdown files for each page, preserving document structure like tables and images. It uses BeautifulSoup for HTML parsing and supports multithreading for faster crawling with resumable sessions. The output is designed to be easily chunked and processed for retrieval augmented generation systems, LLM fine-tuning datasets, and agent knowledge bases.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.