← all repositories
unclecode/crawl4ai

A web crawler born from turbo anger mode

The most-starred crawler on GitHub exists because its creator refused to pay $16 for a bad API.

68k stars Python Data ToolingOther AI
crawl4ai
Velocity · 7d
+90
★ / day
Trend
steady
star history

What it does Crawl4AI turns messy web pages into clean, structured Markdown built specifically for LLMs. It runs async browser pools, handles JavaScript-heavy sites, and spits out RAG-ready output with citations, tables, and code blocks intact. There’s also a CLI (crwl) and a Dockerized FastAPI server if you want to self-host.

The interesting bit The author built this in days after an “open source” competitor demanded an account, API token, and $16 for subpar results. That origin story shows in the design: zero API keys required, full browser control via Playwright, and hooks at every step for custom behavior. The v0.8.7 release is a security-hardening patch that fixed critical Docker API vulnerabilities including RCE, SSRF, and a hardcoded JWT secret—if you self-host, upgrade immediately.

Key highlights

  • LLM-native output: BM25-based noise filtering, chunking strategies, and cosine similarity matching for semantic extraction
  • Browser fingerprinting control: persistent profiles, proxy support, stealth mode, and CDP remote control for dodging bot detection
  • Resumable deep crawls: resume_state and on_state_change callbacks for long jobs that crash, plus a prefetch=True mode claiming 5-10x faster URL discovery
  • Dual extraction modes: fast CSS/XPath schema extraction or LLM-driven structured JSON for complex patterns
  • Supply-chain paranoia: replaced litellm with unclecode-litellm after a PyPI compromise in v0.8.6

Caveats

  • The Docker API has had multiple critical security issues; self-hosting requires vigilance
  • “Adaptive intelligence” that “learns site patterns” is mentioned but not explained in detail—unclear how it works
  • Sponsorship tiers up to $2,000/month suggest the project is actively monetizing; the “free” positioning may shift

Verdict Grab this if you’re building RAG pipelines or AI agents and need clean web data without vendor lock-in. Skip it if you want a managed, zero-maintenance solution—the self-hosted Docker path has real security baggage.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.