← all repositories
ScrapeGraphAI/Scrapegraph-ai

Scraping by natural language, not XPath archaeology

A Python library that lets you point an LLM at a website and ask for what you want, instead of hand-crafting selectors.

26.9k stars Python RAG · SearchData Tooling
Scrapegraph-ai
Velocity · 7d
+31
★ / day
Trend
steady
star history

What it does ScrapeGraphAI is a Python scraping library built around LLMs and “direct graph logic” — essentially, pipelines of nodes that fetch, parse, and extract data based on natural-language prompts. You tell it what you want from a page (or local HTML/JSON/Markdown file), and it returns structured output. It also handles multi-page search scraping, script generation, and even text-to-speech output if you need it.

The interesting bit The “graph” abstraction is the hook: each scraping task is a directed pipeline (SmartScraperGraph, SearchGraph, etc.) where LLM calls can run in parallel in the “multi” variants. The library sits on top of Playwright for fetching and supports a wide range of LLM backends — from OpenAI and Groq to local Ollama models — without changing your pipeline code, just the config dict.

Key highlights

  • Prompt-driven extraction: no CSS selectors or regex required in the basic case
  • Six built-in pipeline types, from single-page scrapers to search-result aggregators and Python-script generators
  • Pluggable LLM backends: OpenAI, Azure, Gemini, Groq, MiniMax, or local models via Ollama
  • Parallel execution support via “multi” graph variants
  • Integrations with LangChain, LlamaIndex, Crew.ai, and low-code tools like Zapier, n8n, Bubble
  • Also available as a hosted API with Python and Node.js SDKs

Caveats

  • Requires Playwright installation separately (playwright install) for website fetching
  • Collects anonymous telemetry by default; opt-out via environment variable
  • The README’s “5 lines of code” pitch at the top links to a commercial hosted version, not the open-source library

Verdict Worth a look if you maintain brittle scrapers that break every time a site redesigns, or if you want non-technical team members to specify extraction tasks. Skip it if you need deterministic, low-latency scraping without LLM costs and failure modes.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.