← all repositories
shuaihuaiyi/QA

Undergrad's LSTM QA project: honest, deprecated, oddly refreshing

A Chinese question-answering system whose author tells you not to use it.

556 stars Python Language Models
QA
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

This repo implements a sentence-level answer retrieval system for Chinese text: given a question and multiple candidate sentences, a bidirectional LSTM identifies which sentence contains the answer. It uses jieba for segmentation and pre-trained 50-dimensional word embeddings from Chinese Wikipedia. The author reports MRR above 0.75 on a held-out dev set.

The interesting bit

The README opens with “该项目已停止维护!!!” and calls the code “基本全是瞎写的” — a level of self-awareness rare in academic GitHub repos. The author admits the model was chosen for convenience, the API usage was clumsy, and the hyperparameter tuning was “很粗糙.” This transparency is more useful than most polished-but-unreproducible NLP papers.

Key highlights

  • BiLSTM architecture for sentence ranking in Chinese QA
  • Uses jieba + 50-dim Wikipedia-trained word embeddings
  • Evaluation via MRR, MAP, and ACC@1 (script credited to a teaching assistant)
  • TensorFlow 1.2.1, Python 3.5.2 — firmly archaeological stack
  • Training: ~8GB RAM, 2GB VRAM, 12 hours on a GTX 850M
  • Results vary ±0.03 MRR across runs with identical parameters; cause unknown

Caveats

  • Explicitly abandoned by the author with no maintenance planned
  • Dataset cannot be shared due to licensing; you’ll need your own training.data and develop.data
  • “代码层面还是学术层面都没有太大参考价值” — the author’s own assessment
  • Hardware requirements and TF 1.x dependencies make reproduction a deliberate exercise in retrocomputing

Verdict

Worth a skim if you’re studying how not to structure a deep learning project, or if you need a baseline biLSTM implementation you can freely criticize. Anyone seeking a production Chinese QA system should look elsewhere — the author would agree.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.