NLP from scratch, with typos and all
A PyTorch tutorial collection that teaches by doing, not by polishing.

What it does This repo bundles six hands-on NLP tutorials covering text classification, neural machine translation, and language modeling. Each sub-project is a self-contained notebook or script with “simple annotation” — the author’s phrase, not mine — walking through implementations of CBoW, LSTM, Transformer, seq2seq, and TextCNN on real datasets like HuffPost news, IMDb reviews, and Wikipedia text.
The interesting bit The author leaves the rough edges in: typos in folder names (“classifcation”), inconsistent frameworks (mostly PyTorch, one stray Keras entry), and variable-length sequences handled without padding fanfare. That messiness is arguably the pedagogy — you see how working code actually looks, not how a textbook says it should.
Key highlights
- Covers both classic architectures (CBoW, LSTM) and newer ones (Transformer, SentencePiece)
- Includes a Korean-language sentiment analysis example using TextCNN — rare in English-dominated tutorial repos
- Question-answer matching uses Stack Exchange data with TF-IDF and learned embeddings side by side
- Neural language model tutorial lives in a separate repo, linked but not integrated
- Each tutorial claims “simple” implementation; the repetition suggests a genuine design principle, not modesty
Caveats
- “Simple annotation” is accurate: comments are sparse, not explanatory
- One tutorial uses Keras while the rest use PyTorch; the switch is unexplained
- The language model tutorial is an external repo, so this isn’t a fully self-contained curriculum
Verdict Good for developers who learn by reading working code and filling in gaps themselves. Skip it if you need narrative hand-holding or a unified framework throughout.