← all repositories
practicingman/chinese_text_cnn

Kim's 2014 CNN, still kicking in Chinese

A faithful PyTorch port of the classic TextCNN paper, wired for Chinese sentiment analysis with jieba and Zhihu word vectors.

646 stars Python Language Models
chinese_text_cnn
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

Implements the four embedding variants from Yoon Kim’s 2014 sentence-classification paper—random, static, non-static, and multichannel—then trains them on a Chinese text corpus for sentiment classification. Tokenization is handled by jieba; word vectors come from a Zhihu QA-trained Word2vec model via the Chinese-Word-Vectors project.

The interesting bit

The README is essentially a lab notebook: every variant has a concrete accuracy number, and the progression is clean. Random initialization hits 94%, frozen pretrained vectors jump to 95%, and fine-tuning the embeddings nudges it to 96%. The multichannel trick (static + fine-tuned side by side) matches fine-tuning alone, which is itself a useful data point.

Key highlights

  • Four Kim CNN variants in one script, toggled by CLI flags (-static, -non-static, -multichannel)
  • Pretrained Chinese word vectors from a real social-QA corpus, not generic news text
  • Early stopping with a 1000-step patience baked in
  • Dependencies pinned to PyTorch 1.0.0 and torchtext 0.3.1—archaeologically precise

Caveats

  • No mention of what the 7,000-sample evaluation dataset actually is
  • PyTorch 1.0.0 and torchtext 0.3.1 are years out of date; expect dependency archaeology to run it today
  • No code structure or module breakdown shown in the README

Verdict

Worth a look if you need a minimal, working TextCNN baseline for Chinese text and don’t mind updating the dependency stack. Skip it if you want modern transformers, production-grade logging, or any explanation of the training data’s provenance.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.