← all repositories
0xMassi/webclaw

A Rust scraper that speaks MCP and minds its own business

webclaw turns noisy HTML into clean markdown, JSON, or LLM-ready text—locally, without calling home.

1.3k stars Rust RAG · SearchData Tooling
webclaw
Velocity · 7d
+15
★ / day
Trend
steady
star history

What it does

webclaw fetches web pages and strips out the junk: nav bars, ads, scripts, duplicated boilerplate. It returns structured output in five formats including markdown, LLM-optimized text, and JSON. You can scrape one page, crawl a docs site, diff two snapshots, or extract brand assets like colors and logos. It runs as a CLI, an MCP server for Claude/Cursor/etc., or a self-hosted REST API. There’s also a hosted tier at webclaw.io for JS rendering, web search, and async jobs.

The interesting bit

The architecture splits extraction from fetching: webclaw-core has zero network I/O and can be used standalone. That’s unusual in a space where most tools conflate “get the page” with “make it readable.” The MCP server is first-class, not an afterthought—npx create-webclaw auto-detects your AI client and wires it up.

Key highlights

  • Core extraction runs locally without an account; hosted API is opt-in
  • First-class MCP server with one-command setup for Claude, Cursor, Windsurf, etc.
  • Output formats: markdown, LLM-optimized text, plain text, JSON, cleaned HTML
  • Built-in tools: scrape, crawl, map, batch, diff, brand extraction, plus LLM-powered extract/summarize
  • SDKs for TypeScript, Python, Go; Firecrawl-compatible API example included
  • Proxy support via env vars for dodging rate limits

Caveats

  • JavaScript rendering and web search are hosted-only; local mode hits static HTML
  • Building from source needs native toolchain (cmake, clang, ssl-dev); prebuilt binaries are macOS/Linux only
  • Some features (research, watches) require the paid API; the README is upfront about this split

Verdict

Worth a look if you’re building RAG pipelines or giving agents web access and want extraction to happen on your hardware. Skip it if you need heavy JS SPAs and refuse to touch the hosted tier.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.