One developer's field notes from the NLP trenches
A curated learning journal that pairs runnable notebooks with Medium explainers, covering tokenization to T5.

What it does
This repo is essentially a well-organized study guide: Python notebooks and datasets walking through NLP fundamentals and a who’s-who of embedding and transformer models from 2018–2019. The author groups everything into practical buckets—text preprocessing, representation, data augmentation, and general ML tricks—each linking to a Medium article and usually a runnable notebook.
The interesting bit
The value isn’t novelty; it’s curation at scale. The author tracked the entire arc from word2vec through BERT, GPT-2, XLNet, and T5 as they dropped, often adding domain-specific variants (clinical BERT, scientific BERT) that mainstream tutorials skipped. Think of it as a time capsule of NLP’s transformer boom, maintained by someone actually reading the papers.
Key highlights
- Covers the full pipeline: tokenization, lemmatization, spell-checking (Norvig and Symspell), string matching, and stop-word removal
- Character-level, word-level, and sentence-level embeddings each get their own section with paper links and reference implementations
- Data augmentation gets unusual depth: back-translation, adversarial attacks, audio/speech augmentation, and unsupervised methods
- Domain-specific BERT variants (clinical, scientific) included alongside mainstream models
- Most sections pair a Medium explainer with a GitHub notebook—good for reading, then running
Caveats
- README stops mid-word at “MultiFiT” and several paper links have typos (“Googles” for Google, duplicate arXiv IDs)
- Coverage peters out around 2019; no LLaMA, ChatGPT, or modern instruction-tuning era
- Some notebook links are to the author’s other repo,
nlpaug, rather than local code
Verdict
Great if you’re trying to understand how we got here—the progression from bag-of-words to the transformer explosion. Skip it if you need production-ready libraries or state-of-the-art 2024 techniques; this is a learning journal, not a framework.