Is gpt-crawler open source?

Yes — BuilderIO/gpt-crawler is open source, released under the ISC license.

What language is gpt-crawler written in?

BuilderIO/gpt-crawler is primarily written in TypeScript.

How popular is gpt-crawler?

BuilderIO/gpt-crawler has 22.4k stars on GitHub.

Where can I find gpt-crawler?

BuilderIO/gpt-crawler is on GitHub at https://github.com/BuilderIO/gpt-crawler.

← all repositories

BuilderIO/gpt-crawler

Turn any docs site into a custom GPT with one config file

A Playwright crawler that scrapes pages, extracts text via CSS selectors, and emits a JSON file ready for OpenAI's knowledge upload.

★22.4k stars TypeScript Data Tooling Coding Assistants

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does GPT Crawler is a TypeScript CLI that walks a website using Playwright, pulls text from specified DOM selectors, and dumps the results into a single JSON file. That file goes straight into OpenAI to back a custom GPT or Assistants API bot. Docker and an Express API server are included if you prefer containers or HTTP triggers over npm start.

The interesting bit The project treats “knowledge” as a build artifact: you commit a config.ts, run the crawler, and produce a versioned data file. The resourceExclusions array is comically exhaustive—someone clearly copy-pasted every file extension they could find—but it means you won’t accidentally ship a 50 MB PDF to OpenAI.

Key highlights

CSS-selector-based extraction, so you can target just article bodies and skip nav/footers
Hard ceiling on pages (maxPagesToCrawl), file size (maxFileSize), and tokens (maxTokens) to stay inside OpenAI’s limits
Sitemap-aware: hand it a sitemap URL and it skips link discovery entirely
Ships with Docker setup and a Swagger-documented Express API at /crawl
Output is a single JSON blob; OpenAI’s upload flow is documented with GIFs

Caveats

Requires Node.js ≥ 16; the README doesn’t mention if newer versions are tested
Token counting is mentioned but not explained—unclear if it uses tiktoken or a rough heuristic
The API server defaults to port 3000 with no auth discussed; probably not for public exposure

Verdict Worth a look if you maintain docs and want a repeatable pipeline to a custom GPT. Skip it if you need real-time sync or multi-source RAG beyond OpenAI’s file limits.

Frequently asked

What is BuilderIO/gpt-crawler?: A Playwright crawler that scrapes pages, extracts text via CSS selectors, and emits a JSON file ready for OpenAI's knowledge upload.
Is gpt-crawler open source?: Yes — BuilderIO/gpt-crawler is open source, released under the ISC license.
What language is gpt-crawler written in?: BuilderIO/gpt-crawler is primarily written in TypeScript.
How popular is gpt-crawler?: BuilderIO/gpt-crawler has 22.4k stars on GitHub.
Where can I find gpt-crawler?: BuilderIO/gpt-crawler is on GitHub at https://github.com/BuilderIO/gpt-crawler.