← all repositories
any4ai/AnyCrawl

A scraper that speaks fluent LLM

Node.js crawler that turns raw web pages into structured JSON via an LLM layer, with SERP support and a self-hosted API.

3.2k stars MDX Data ToolingRAG · Search
AnyCrawl
Velocity · 7d
+7.3
★ / day
Trend
steady
star history

What it does AnyCrawl is a self-hostable Node.js/TypeScript crawling toolkit with three main jobs: scrape single pages, crawl entire sites, and extract structured search results from Google (with Bing and Baidu promised). It exposes everything through a REST API and can hand off page content to an LLM for structured JSON extraction.

The interesting bit The LLM extraction layer is the hook. Instead of just dumping HTML or markdown, you pass a JSON schema and the tool asks an LLM to fill it in — company mission, employee count, boolean flags, whatever you define. It also supports Atlas Cloud as an OpenAI-compatible provider out of the box, which feels like a sponsorship integration dressed up as a feature.

Key highlights

  • Three engines: Cheerio for static HTML, Playwright and Puppeteer for JS-rendered pages
  • Site crawling with depth limits, domain scoping, and path include/exclude rules
  • Built-in proxy support plus a default proxy (details vague; “high-quality” is the README’s word)
  • Redis-backed caching with S3 support for self-hosted deployments
  • Multi-threading and multi-process batch processing
  • MIT licensed

Caveats

  • “Multiple search engines” is overstated: only Google is listed under supported engines, despite the marketing claim
  • The README is heavy on badges and sponsor banners; actual technical depth lives in external docs
  • No benchmarks, rate-limiting details, or cost estimates for the LLM extraction path

Verdict Worth a look if you’re building RAG pipelines or AI agents and want a single self-hosted box that can crawl, cache, and structure data via LLM. Skip it if you need reliable multi-engine SERP today or want deep visibility into how the extraction prompts actually behave.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.