Scraping by natural language, not XPath archaeology
A Python library that lets you point an LLM at a website and ask for what you want, instead of hand-crafting selectors.

What it does ScrapeGraphAI is a Python scraping library built around LLMs and “direct graph logic” — essentially, pipelines of nodes that fetch, parse, and extract data based on natural-language prompts. You tell it what you want from a page (or local HTML/JSON/Markdown file), and it returns structured output. It also handles multi-page search scraping, script generation, and even text-to-speech output if you need it.
The interesting bit The “graph” abstraction is the hook: each scraping task is a directed pipeline (SmartScraperGraph, SearchGraph, etc.) where LLM calls can run in parallel in the “multi” variants. The library sits on top of Playwright for fetching and supports a wide range of LLM backends — from OpenAI and Groq to local Ollama models — without changing your pipeline code, just the config dict.
Key highlights
- Prompt-driven extraction: no CSS selectors or regex required in the basic case
- Six built-in pipeline types, from single-page scrapers to search-result aggregators and Python-script generators
- Pluggable LLM backends: OpenAI, Azure, Gemini, Groq, MiniMax, or local models via Ollama
- Parallel execution support via “multi” graph variants
- Integrations with LangChain, LlamaIndex, Crew.ai, and low-code tools like Zapier, n8n, Bubble
- Also available as a hosted API with Python and Node.js SDKs
Caveats
- Requires Playwright installation separately (
playwright install) for website fetching - Collects anonymous telemetry by default; opt-out via environment variable
- The README’s “5 lines of code” pitch at the top links to a commercial hosted version, not the open-source library
Verdict Worth a look if you maintain brittle scrapers that break every time a site redesigns, or if you want non-technical team members to specify extraction tasks. Skip it if you need deterministic, low-latency scraping without LLM costs and failure modes.