← all repositories

ChenghaoMou/text-dedup

A text deduplication toolkit implementing MinHash, SimHash, SuffixArray, and Bloom Filter algorithms for removing near-duplicate and exact text from datasets.

760 stars Python Data Tooling
text-dedup
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

This repository provides ready-to-use text deduplication scripts for preparing ML training datasets. It supports near-duplicate detection using MinHash with MinHashLSH and SimHash with 64 or 128-bit variants, along with exact deduplication via Bloom filters and SuffixArray substring matching. All algorithms are configured through TOML files for easy customization in data preprocessing pipelines.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.