← all repositories
shawroad/NLP_pytorch_project

A Chinese NLP cookbook that's splitting into smaller kitchens

570-star repo collects PyTorch implementations of classic NLP architectures, now being broken into focused sub-repos for easier maintenance.

NLP_pytorch_project
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does This is a broad collection of PyTorch implementations for common NLP tasks: text classification, named entity recognition, machine translation, reading comprehension, text generation, and more. Each folder contains a standalone model or technique—BERT variants, GRU+attention seq2seq, TinyBERT distillation, GPT-2 for Chinese title generation, etc. The author notes the repo has grown unwieldy and is actively splitting tasks into separate repositories.

The interesting bit The value is in breadth and accessibility: you get working, commented Chinese-language implementations of everything from skip-gram Word2Vec to QANet to FastBERT self-distillation. For reading comprehension specifically, the author flags one baseline as the place to start—it includes sliding-window long-text handling, answer ranking, and adversarial training in one file.

Key highlights

  • ~20 distinct NLP tasks covered, from embedding pre-training to slot filling to text correction
  • Multiple BERT distillation recipes: DynaBERT (pruning), TinyBERT (intermediate-layer MSE), and a 3-layer Transformer student
  • Reading comprehension gets unusually deep coverage: 13 implementations including BiDAF, QANet, Match-LSTM, and multiple pretrained-model variants
  • Text generation includes a from-scratch GPT-2 implementation plus fine-tuning scripts for summarization and title generation
  • Chinese NLP focus: WoBERT (custom vocab), BERT retraining with MLM, GPT-2 for Chinese text generation

Caveats

  • The README is a flat directory listing with minimal usage instructions; you’ll need to dig into individual folders
  • The author explicitly states maintenance is becoming difficult and recommends migrating to newer split-out repos for text classification, semantic similarity, and text generation
  • No benchmarks, training data, or pre-trained weights are mentioned

Verdict Good for researchers or students who want readable, runnable PyTorch implementations of standard NLP architectures with Chinese-language comments. Skip if you need a maintained, documented library with pip install and pre-trained models—this is reference code, not a framework.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.