A ghost in the machine, now exorcised
An experimental 2023 library that handed your web scraping to GPT—complete with cost controls, hallucination checks, and a $0.36-per-page price tag.

What it does
scrapeghost was a Python wrapper that outsourced web scraping to OpenAI’s GPT models. You defined a schema in plain Python objects, fed it HTML, and the LLM extracted structured data. The library handled the boring infrastructure: cleaning HTML, splitting oversized pages across multiple API calls, validating JSON, checking for hallucinations, and tracking token spend against a budget.
The interesting bit
The author built guardrails for a fundamentally reckless idea. A “hallucination check” verifies GPT’s output against the original page, and you could set a hard cost ceiling—useful when a single GPT-4 call on a “moderately sized page” ran $0.36. The auto-splitting feature was genuinely clever: chunk large HTML across multiple calls rather than truncating and hoping.
Key highlights
- Python-native schema definition—no JSON Schema wrestling
- Preprocessing pipeline: HTML cleaning, CSS/XPath pre-filtering, auto-splitting for large pages
- Postprocessing: JSON validation with GPT self-correction, Pydantic schema enforcement, hallucination detection
- Cost tracking with running token totals and budget hard-stops
- Automatic model fallback (GPT-3.5-Turbo → GPT-4) for cost/quality tradeoffs
Caveats
- Explicitly unmaintained: author “has no interest in working with commercial LLMs” as of the README date
- Expensive at scale: ~$0.36 per moderate page on GPT-4, with pricing estimates explicitly “not guaranteed to be accurate”
- Source code moved to Codeberg; GitHub repo is archival
Verdict
Worth studying if you’re building LLM scraping tools today and want to learn from prior guardrail design. Not worth installing unless you enjoy maintaining abandoned dependencies. The author’s disavowal is the most honest README note you’ll read this week.