← all repositories

watercrawl/WaterCrawl

A web crawler that extracts and transforms web content into markdown format optimized for LLM consumption.

1.8k stars TypeScript Data Tooling
WaterCrawl
Velocity · 7d
+3.4
★ / day
Trend
steady
star history

WaterCrawl is a web scraping and crawling application built with Python, Django, Scrapy, and Celery that extracts content from websites and converts HTML into markdown. It specifically targets AI/LLM use cases by preparing web data in formats suitable for model training or RAG pipelines. The tool supports structured data extraction and includes Docker deployment options.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.