← all repositories

cxcscmu/Craw4LLM

Craw4LLM is a web crawler optimized for gathering high-quality text data to pretrain large language models.

653 stars Python Data ToolingLanguage Models
Craw4LLM
Velocity · 7d
+1.4
★ / day
Trend
steady
star history

The project implements a pipeline for efficiently crawling and filtering documents from the ClueWeb22 dataset for LLM training. It uses a multi-stage selection process combining length filtering with fastText-based quality scoring via the DCLM classifier. Documents are ranked and selected iteratively to build training corpora optimized for language model pretraining.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.