425776024/nlpcda
A Chinese NLP data augmentation toolkit that uses BERT and SimBERT to generate synthetic training samples.

This repository provides a one-click Chinese data augmentation tool for NLP tasks. It supports multiple augmentation strategies including synonym replacement, random character deletion, NER BIO data augmentation, and generative similar sentence generation using SimBERT. The toolkit also explores audio-based text augmentation through text-to-speech and speech recognition pipelines using FastSpeech2 and Wav2Vec2. It is designed to improve NLP model generalization and robustness against adversarial inputs by generating semantically consistent training data.