Is text-dedup open source?

Yes — ChenghaoMou/text-dedup is open source, released under the Apache-2.0 license.

What language is text-dedup written in?

ChenghaoMou/text-dedup is primarily written in Python.

How popular is text-dedup?

ChenghaoMou/text-dedup has 764 stars on GitHub.

Where can I find text-dedup?

ChenghaoMou/text-dedup is on GitHub at https://github.com/ChenghaoMou/text-dedup.

← all repositories

ChenghaoMou/text-dedup

Stop writing dedup scripts, start editing TOML

It exists so you can compare and run four major text deduplication algorithms by editing a TOML file instead of writing boilerplate Python.

★764 stars Python Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does The toolkit offers four ready-made algorithms for scrubbing repetitive text from large datasets. It handles near-duplicate detection with MinHash and SimHash, plus exact matching via Bloom filters and suffix arrays. You steer the whole process through a single config.toml file rather than writing glue code, and it is explicitly aimed at the large-scale cleaning workflows common in NLP pre-training.

The interesting bit The clever angle is treating algorithm selection as a configuration change rather than a code change: you swap MinHash for SimHash by editing a TOML file. The included benchmarks on CORE and NEWS-COPY let you trade off speed and accuracy before picking a strategy, and the defaults are informed by the author’s work on BigScience and BigCode.

Key highlights

Bundles four algorithms: MinHash, SimHash, Bloom Filter, and Suffix Array.
Algorithm choice and parameters live in a single config.toml file.
Ships with reproducible benchmarks on CORE and NEWS-COPY datasets.
MinHash completed the CORE benchmark in 11.09 s while SimHash took 626.11 s, yet still achieved the highest macro F1 score.
Influenced by data-cleaning workflows from BigScience and BigCode.

Caveats

The Suffix Array mode requires an external Google Research repository at the path specified in its configuration.
SimHash is far slower than MinHash in the bundled benchmarks, so algorithm choice heavily impacts throughput.
The project is framed as a collection of standalone scripts rather than an importable library API.

Verdict Worth a look if you are prepping large text corpora for training and want to compare deduplication strategies without writing scaffolding. Skip it if you need a library API for inline deduplication inside an existing application pipeline.

Frequently asked

What is ChenghaoMou/text-dedup?: It exists so you can compare and run four major text deduplication algorithms by editing a TOML file instead of writing boilerplate Python.
Is text-dedup open source?: Yes — ChenghaoMou/text-dedup is open source, released under the Apache-2.0 license.
What language is text-dedup written in?: ChenghaoMou/text-dedup is primarily written in Python.
How popular is text-dedup?: ChenghaoMou/text-dedup has 764 stars on GitHub.
Where can I find text-dedup?: ChenghaoMou/text-dedup is on GitHub at https://github.com/ChenghaoMou/text-dedup.