ChenghaoMou/text-dedup
A text deduplication toolkit implementing MinHash, SimHash, SuffixArray, and Bloom Filter algorithms for removing near-duplicate and exact text from datasets.

Velocity · 7d
+0.4
★ / day
Trend
→steady
star history
This repository provides ready-to-use text deduplication scripts for preparing ML training datasets. It supports near-duplicate detection using MinHash with MinHashLSH and SimHash with 64 or 128-bit variants, along with exact deduplication via Bloom filters and SuffixArray substring matching. All algorithms are configured through TOML files for easy customization in data preprocessing pipelines.