Turn any docs site into a custom GPT with one config file
A Playwright crawler that scrapes pages, extracts text via CSS selectors, and emits a JSON file ready for OpenAI's knowledge upload.

What it does
GPT Crawler is a TypeScript CLI that walks a website using Playwright, pulls text from specified DOM selectors, and dumps the results into a single JSON file. That file goes straight into OpenAI to back a custom GPT or Assistants API bot. Docker and an Express API server are included if you prefer containers or HTTP triggers over npm start.
The interesting bit
The project treats “knowledge” as a build artifact: you commit a config.ts, run the crawler, and produce a versioned data file. The resourceExclusions array is comically exhaustive—someone clearly copy-pasted every file extension they could find—but it means you won’t accidentally ship a 50 MB PDF to OpenAI.
Key highlights
- CSS-selector-based extraction, so you can target just article bodies and skip nav/footers
- Hard ceiling on pages (
maxPagesToCrawl), file size (maxFileSize), and tokens (maxTokens) to stay inside OpenAI’s limits - Sitemap-aware: hand it a sitemap URL and it skips link discovery entirely
- Ships with Docker setup and a Swagger-documented Express API at
/crawl - Output is a single JSON blob; OpenAI’s upload flow is documented with GIFs
Caveats
- Requires Node.js ≥ 16; the README doesn’t mention if newer versions are tested
- Token counting is mentioned but not explained—unclear if it uses tiktoken or a rough heuristic
- The API server defaults to port 3000 with no auth discussed; probably not for public exposure
Verdict Worth a look if you maintain docs and want a repeatable pipeline to a custom GPT. Skip it if you need real-time sync or multi-source RAG beyond OpenAI’s file limits.