← all repositories
zhanlaoban/EDA_NLP_for_Chinese

Four cheap tricks to stretch your Chinese NLP dataset

A port of the EDA paper that swaps, deletes, and jumbles Chinese text to fake more training data.

1.4k stars Python Data ToolingLanguage Models
EDA_NLP_for_Chinese
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does Takes a tab-separated Chinese text file and spits out augmented copies using four simple mutations: synonym replacement, random insertion, random swap, and random deletion. It is a direct adaptation of the English EDA toolkit, wired for Chinese tokenization via jieba and synonym lookup via the Synonyms library.

The interesting bit The README doubles as a surprisingly thorough paper reading note, complete with LaTeX formulas and experimental tables. The author did not just translate the code; they translated the paper’s reasoning into Chinese, including the warning that too much augmentation can corrupt labels.

Key highlights

  • Four augmentation strategies: synonym replacement, random insertion, random swap, random deletion
  • Sentence-length-aware mutation counts via an alpha scaling factor
  • Bundles four common Chinese stopword lists (cn, HIT, Baidu, SCU)
  • CLI interface: python code/augment.py --input=train.txt --output=train_augmented.txt --num_aug=16 --alpha=0.05
  • Paper notes claim EDA can match full-dataset accuracy with only 50% of the data on small classification tasks

Caveats

  • The code itself is not shown in the README; you are trusting that augment.py exists and matches the described behavior
  • Synonym quality depends on the bundled Synonyms library, which is not discussed in detail
  • The paper’s reported gains (0.8% on full data, 3% on 500-sample sets) are from English benchmarks, not Chinese ones

Verdict Worth a look if you are building Chinese text classifiers with thin training data and want a quick augmentation baseline before trying heavier artillery like back-translation or LLM-based generation. Skip it if you need guarantees that label semantics survive the noise.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.