How to win a Chinese NLP competition: throw every model at it
A 2017 competition-winning repo that ensembles CNNs, LSTMs, RCNNs, and even FastText to classify Zhihu questions.

What it does This is the first-place solution from the 2017 Zhihu Machine Learning Challenge (963 teams). It classifies Chinese questions into topics using a battery of neural text models, then ensembles their predictions. The README is essentially a training manual: data preprocessing scripts, exact shell commands for each model variant, and a scoreboard showing what each architecture achieved.
The interesting bit
The winning insight isn’t architectural novelty—it’s systematic brute force. The authors trained separate word-level and character-level versions of every model, tried data augmentation for each, and ensembled the survivors to push from ~0.41 to 0.433. They even include a del/ directory of failed methods, which is more honest than most competition write-ups.
Key highlights
- Five model families: CNN, LSTM, RCNN, Inception-style CNN, and FastText
- Both word and character embeddings, with augmentation toggles
- Published score table: LSTM_word_aug hits 0.41368, ensemble reaches 0.433
- Preprocessing requires >32GB RAM and uses
tf.contrib.kerasdespite being a PyTorch project - Pretrained models hosted on Baidu Pan (password: tayb)
Caveats
- Python 2 and PyTorch 0.x era; setup instructions mention CUDA without specifying version
- Data paths are hardcoded (“modify the data path in the related file”)
- Pretrained weights live on Baidu Pan with no mirror; reproducibility depends on Chinese cloud storage
Verdict Worth studying if you’re building an ensemble pipeline or working on legacy Chinese NLP benchmarks. Skip it if you need modern PyTorch, clean abstractions, or a library you can pip install—this is glue code that happened to win.