← all repositories
chenyuntc/PyTorchText

How to win a Chinese NLP competition: throw every model at it

A 2017 competition-winning repo that ensembles CNNs, LSTMs, RCNNs, and even FastText to classify Zhihu questions.

1.1k stars Python ML FrameworksLanguage Models
PyTorchText
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does This is the first-place solution from the 2017 Zhihu Machine Learning Challenge (963 teams). It classifies Chinese questions into topics using a battery of neural text models, then ensembles their predictions. The README is essentially a training manual: data preprocessing scripts, exact shell commands for each model variant, and a scoreboard showing what each architecture achieved.

The interesting bit The winning insight isn’t architectural novelty—it’s systematic brute force. The authors trained separate word-level and character-level versions of every model, tried data augmentation for each, and ensembled the survivors to push from ~0.41 to 0.433. They even include a del/ directory of failed methods, which is more honest than most competition write-ups.

Key highlights

  • Five model families: CNN, LSTM, RCNN, Inception-style CNN, and FastText
  • Both word and character embeddings, with augmentation toggles
  • Published score table: LSTM_word_aug hits 0.41368, ensemble reaches 0.433
  • Preprocessing requires >32GB RAM and uses tf.contrib.keras despite being a PyTorch project
  • Pretrained models hosted on Baidu Pan (password: tayb)

Caveats

  • Python 2 and PyTorch 0.x era; setup instructions mention CUDA without specifying version
  • Data paths are hardcoded (“modify the data path in the related file”)
  • Pretrained weights live on Baidu Pan with no mirror; reproducibility depends on Chinese cloud storage

Verdict Worth studying if you’re building an ensemble pipeline or working on legacy Chinese NLP benchmarks. Skip it if you need modern PyTorch, clean abstractions, or a library you can pip install—this is glue code that happened to win.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.