← all repositories
lyeoni/nlp-tutorial

NLP from scratch, with typos and all

A PyTorch tutorial collection that teaches by doing, not by polishing.

1.4k stars Jupyter Notebook Language ModelsLearningML Frameworks
nlp-tutorial
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does This repo bundles six hands-on NLP tutorials covering text classification, neural machine translation, and language modeling. Each sub-project is a self-contained notebook or script with “simple annotation” — the author’s phrase, not mine — walking through implementations of CBoW, LSTM, Transformer, seq2seq, and TextCNN on real datasets like HuffPost news, IMDb reviews, and Wikipedia text.

The interesting bit The author leaves the rough edges in: typos in folder names (“classifcation”), inconsistent frameworks (mostly PyTorch, one stray Keras entry), and variable-length sequences handled without padding fanfare. That messiness is arguably the pedagogy — you see how working code actually looks, not how a textbook says it should.

Key highlights

  • Covers both classic architectures (CBoW, LSTM) and newer ones (Transformer, SentencePiece)
  • Includes a Korean-language sentiment analysis example using TextCNN — rare in English-dominated tutorial repos
  • Question-answer matching uses Stack Exchange data with TF-IDF and learned embeddings side by side
  • Neural language model tutorial lives in a separate repo, linked but not integrated
  • Each tutorial claims “simple” implementation; the repetition suggests a genuine design principle, not modesty

Caveats

  • “Simple annotation” is accurate: comments are sparse, not explanatory
  • One tutorial uses Keras while the rest use PyTorch; the switch is unexplained
  • The language model tutorial is an external repo, so this isn’t a fully self-contained curriculum

Verdict Good for developers who learn by reading working code and filling in gaps themselves. Skip it if you need narrative hand-holding or a unified framework throughout.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.