← all repositories

chiphuyen/lazynlp

A Python library for crawling, cleaning, and deduplicating webpages to build massive monolingual datasets.

2.3k stars Python Data Tooling
lazynlp
Velocity · 7d
+0.9
★ / day
Trend
steady
star history

Lazynlp provides tools to scrape web pages, clean content, and deduplicate text from sources like Reddit and Project Gutenberg. It aims to enable users to create datasets larger than those used for GPT-2, handling URL collection, content extraction, and deduplication workflows. The library is designed for building large-scale text corpora used in language model training.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.