← all repositories
jamesturk/scrapeghost

A ghost in the machine, now exorcised

An experimental 2023 library that handed your web scraping to GPT—complete with cost controls, hallucination checks, and a $0.36-per-page price tag.

1.4k stars Python RAG · Search
scrapeghost
Velocity · 7d
+1.2
★ / day
Trend
steady
star history

What it does

scrapeghost was a Python wrapper that outsourced web scraping to OpenAI’s GPT models. You defined a schema in plain Python objects, fed it HTML, and the LLM extracted structured data. The library handled the boring infrastructure: cleaning HTML, splitting oversized pages across multiple API calls, validating JSON, checking for hallucinations, and tracking token spend against a budget.

The interesting bit

The author built guardrails for a fundamentally reckless idea. A “hallucination check” verifies GPT’s output against the original page, and you could set a hard cost ceiling—useful when a single GPT-4 call on a “moderately sized page” ran $0.36. The auto-splitting feature was genuinely clever: chunk large HTML across multiple calls rather than truncating and hoping.

Key highlights

  • Python-native schema definition—no JSON Schema wrestling
  • Preprocessing pipeline: HTML cleaning, CSS/XPath pre-filtering, auto-splitting for large pages
  • Postprocessing: JSON validation with GPT self-correction, Pydantic schema enforcement, hallucination detection
  • Cost tracking with running token totals and budget hard-stops
  • Automatic model fallback (GPT-3.5-Turbo → GPT-4) for cost/quality tradeoffs

Caveats

  • Explicitly unmaintained: author “has no interest in working with commercial LLMs” as of the README date
  • Expensive at scale: ~$0.36 per moderate page on GPT-4, with pricing estimates explicitly “not guaranteed to be accurate”
  • Source code moved to Codeberg; GitHub repo is archival

Verdict

Worth studying if you’re building LLM scraping tools today and want to learn from prior guardrail design. Not worth installing unless you enjoy maintaining abandoned dependencies. The author’s disavowal is the most honest README note you’ll read this week.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.