chiphuyen/lazynlp
A Python library for crawling, cleaning, and deduplicating webpages to build massive monolingual datasets.

Velocity · 7d
+0.9
★ / day
Trend
→steady
star history
Lazynlp provides tools to scrape web pages, clean content, and deduplicate text from sources like Reddit and Project Gutenberg. It aims to enable users to create datasets larger than those used for GPT-2, handling URL collection, content extraction, and deduplication workflows. The library is designed for building large-scale text corpora used in language model training.