Is Craw4LLM open source?

Yes — cxcscmu/Craw4LLM is open source, released under the MIT license.

What language is Craw4LLM written in?

cxcscmu/Craw4LLM is primarily written in Python.

How popular is Craw4LLM?

cxcscmu/Craw4LLM has 660 stars on GitHub.

Where can I find Craw4LLM?

cxcscmu/Craw4LLM is on GitHub at https://github.com/cxcscmu/Craw4LLM.

← all repositories

cxcscmu/Craw4LLM

A pickier web crawler for LLM training data

Most crawlers collect the web indiscriminately; this one ranks ClueWeb22 pages by quality scores to build a leaner LLM pretraining corpus.

★660 stars Python Data Tooling Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does Craw4LLM is a simulated crawler that treats the ClueWeb22 archive as a closed web graph, selectively harvesting documents for LLM pretraining. Instead of grabbing every page in reach, it iteratively scores candidates using configurable raters—like a DCLM fastText classifier or a length heuristic—and keeps only the top-ranked material. The result is a curated list of document IDs ready to be fetched and fed into the DCLM training framework.

The interesting bit This turns web crawling from a distributed systems headache into a local graph-ranking exercise. By replaying links inside the static ClueWeb22 snapshot, you can test whether quality-aware selection beats random or popularity-based crawling without touching a live server. Ranking strategies are mixed and matched through a YAML config.

Key highlights

Targets LLM pretraining data quality, not bulk collection.
Runs fully offline against the ClueWeb22 snapshot.
Pluggable scoring via YAML: combine fasttext_score, length, and inlink_count raters.
Includes random and indegree baselines for head-to-head comparison.
Feeds directly into the DCLM pretraining framework.

Caveats

Requires requesting the ClueWeb22 dataset separately and storing it on an SSD.
The README is tightly coupled to its research paper, so expect a research artifact rather than a general-purpose tool.

Verdict Worth exploring if you are experimenting with LLM data curation and already have ClueWeb22 access. If you need a live web spider or a turn-key training stack, look elsewhere.

Frequently asked

What is cxcscmu/Craw4LLM?: Most crawlers collect the web indiscriminately; this one ranks ClueWeb22 pages by quality scores to build a leaner LLM pretraining corpus.
Is Craw4LLM open source?: Yes — cxcscmu/Craw4LLM is open source, released under the MIT license.
What language is Craw4LLM written in?: cxcscmu/Craw4LLM is primarily written in Python.
How popular is Craw4LLM?: cxcscmu/Craw4LLM has 660 stars on GitHub.
Where can I find Craw4LLM?: cxcscmu/Craw4LLM is on GitHub at https://github.com/cxcscmu/Craw4LLM.