cxcscmu/Craw4LLM
Craw4LLM is a web crawler optimized for gathering high-quality text data to pretrain large language models.

Velocity · 7d
+1.4
★ / day
Trend
→steady
star history
The project implements a pipeline for efficiently crawling and filtering documents from the ClueWeb22 dataset for LLM training. It uses a multi-stage selection process combining length filtering with fastText-based quality scoring via the DCLM classifier. Documents are ranked and selected iteratively to build training corpora optimized for language model pretraining.