Is nlpcda open source?

Yes — 425776024/nlpcda is open source, released under the Apache-2.0 license.

What language is nlpcda written in?

425776024/nlpcda is primarily written in Python.

How popular is nlpcda?

425776024/nlpcda has 1.9k stars on GitHub.

Where can I find nlpcda?

425776024/nlpcda is on GitHub at https://github.com/425776024/nlpcda.

← all repositories

425776024/nlpcda

Nine ways to warp Chinese text without breaking your NER tags

Generates synthetic Chinese training text by perturbing homophones, entities, and word order while shielding dates, numbers, and NER labels from corruption.

★1.9k stars Python Data Tooling ML Frameworks

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

nlpcda is a Python toolkit that synthesizes Chinese training text through nine augmentation strategies: entity replacement, synonym and homophone swapping, character deletion, adjacent-character transposition, equivalent-character substitution (e.g., 1 → 壹 → ①), NER-aware BIO augmentation, back-translation, and SimBERT paraphrasing. Each strategy is exposed as a standalone class with configurable change rates and built-in lexicons, and the library explicitly shields dates, times, and numeric fragments from corruption.

The interesting bit

The author treats Chinese text as a first-class citizen rather than an afterthought: the library exploits the well-known “character order” reading phenomenon by locally shuffling adjacent characters, and it maintains a dedicated Ner class that respects BIO boundaries. There is even work-in-progress support for a speech-laundering pipeline—synthesizing audio with Fastspeech2 and transcribing it back with Wav2Vec2 to introduce realistic ASR noise.

Key highlights

NER-native: the Ner class reads BIO-tagged files and augments only selected entity types while keeping labels aligned.
Linguistically defensive: preserves dates, times, and numbers during deletion or transposition; supports homophone confusion (的/地/得) and formal-equivalent characters.
Generative option: integrates SimBERT for neural paraphrasing alongside classical rule-based perturbations.
Honest marketing: the author openly warns that these techniques do not help on pure accuracy competitions, framing the value around robustness instead.

Caveats

The README explicitly warns that this toolkit “一般不会有分数提升” on pure-accuracy leaderboards, so expect generalization gains rather than score jumps.
Aggressive settings can quickly degrade readability, as demonstrated by the README’s own example of progressive garbling (“能被猜出来” → “能被菜粗来”); quality control is left to the user.

Verdict

Useful if you need Chinese training variations with entity and temporal awareness. Irrelevant if you are chasing benchmark accuracy medals or working outside Chinese NLP.

Frequently asked

What is 425776024/nlpcda?: Generates synthetic Chinese training text by perturbing homophones, entities, and word order while shielding dates, numbers, and NER labels from corruption.
Is nlpcda open source?: Yes — 425776024/nlpcda is open source, released under the Apache-2.0 license.
What language is nlpcda written in?: 425776024/nlpcda is primarily written in Python.
How popular is nlpcda?: 425776024/nlpcda has 1.9k stars on GitHub.
Where can I find nlpcda?: 425776024/nlpcda is on GitHub at https://github.com/425776024/nlpcda.