NLP recipes that skip the theory homework
A collection of runnable notebooks for the text-processing tasks you actually need to do at work.

What it does This repo is a curated set of Jupyter notebooks and Python scripts covering bread-and-butter NLP tasks: TF-IDF keyword extraction, text classification with logistic regression, Word2Vec training with Gensim, preprocessing pipelines, and even a PySpark word-count for when your data stops fitting in RAM. Each entry links to a companion blog post by Kavita Ganesan.
The interesting bit The value isn’t novelty—it’s curation. The notebooks explicitly compare easily confused pairs (TFIDFTransformer vs. TFIDFVectorizer, HashingVectorizer vs. CountVectorizer, CBOW vs. SkipGram) that most tutorials gloss over. Think of it as a field guide to sklearn’s vectorizer zoo.
Key highlights
- Runnable notebooks with datasets included where noted (word2vec, tf-idf, text classification)
- Pre-trained embedding loading via Gensim (GloVe and Word2Vec) with a text-similarity example
- PySpark phrase extraction and word count for larger-scale text
- Preprocessing snippets covering stemming, lemmatization, noise removal, and stop-word removal
- Each technique paired with an explanatory article, not just docstring regurgitation
Caveats
- Some entries are external repos (phrase-at-scale, word_cloud) rather than in-tree code
- The “more articles” and mailing-list links suggest this doubles as content marketing; the code appears genuine, but the funnel is visible
Verdict Worth bookmarking if you’re the “just show me a working example” type, especially for sklearn vectorizer gotchas. Skip it if you need deep learning (transformers, etc.) or production-grade pipelines—this is strictly classical NLP territory.