← all repositories
makcedward/nlp

One developer's field notes from the NLP trenches

A curated learning journal that pairs runnable notebooks with Medium explainers, covering tokenization to T5.

1.1k stars Python Language ModelsLearning
nlp
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does

This repo is essentially a well-organized study guide: Python notebooks and datasets walking through NLP fundamentals and a who’s-who of embedding and transformer models from 2018–2019. The author groups everything into practical buckets—text preprocessing, representation, data augmentation, and general ML tricks—each linking to a Medium article and usually a runnable notebook.

The interesting bit

The value isn’t novelty; it’s curation at scale. The author tracked the entire arc from word2vec through BERT, GPT-2, XLNet, and T5 as they dropped, often adding domain-specific variants (clinical BERT, scientific BERT) that mainstream tutorials skipped. Think of it as a time capsule of NLP’s transformer boom, maintained by someone actually reading the papers.

Key highlights

  • Covers the full pipeline: tokenization, lemmatization, spell-checking (Norvig and Symspell), string matching, and stop-word removal
  • Character-level, word-level, and sentence-level embeddings each get their own section with paper links and reference implementations
  • Data augmentation gets unusual depth: back-translation, adversarial attacks, audio/speech augmentation, and unsupervised methods
  • Domain-specific BERT variants (clinical, scientific) included alongside mainstream models
  • Most sections pair a Medium explainer with a GitHub notebook—good for reading, then running

Caveats

  • README stops mid-word at “MultiFiT” and several paper links have typos (“Googles” for Google, duplicate arXiv IDs)
  • Coverage peters out around 2019; no LLaMA, ChatGPT, or modern instruction-tuning era
  • Some notebook links are to the author’s other repo, nlpaug, rather than local code

Verdict

Great if you’re trying to understand how we got here—the progression from bag-of-words to the transformer explosion. Skip it if you need production-ready libraries or state-of-the-art 2024 techniques; this is a learning journal, not a framework.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.