← all repositories
kavgan/nlp-in-practice

NLP recipes that skip the theory homework

A collection of runnable notebooks for the text-processing tasks you actually need to do at work.

1.2k stars Jupyter Notebook LearningData Tooling
nlp-in-practice
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does This repo is a curated set of Jupyter notebooks and Python scripts covering bread-and-butter NLP tasks: TF-IDF keyword extraction, text classification with logistic regression, Word2Vec training with Gensim, preprocessing pipelines, and even a PySpark word-count for when your data stops fitting in RAM. Each entry links to a companion blog post by Kavita Ganesan.

The interesting bit The value isn’t novelty—it’s curation. The notebooks explicitly compare easily confused pairs (TFIDFTransformer vs. TFIDFVectorizer, HashingVectorizer vs. CountVectorizer, CBOW vs. SkipGram) that most tutorials gloss over. Think of it as a field guide to sklearn’s vectorizer zoo.

Key highlights

  • Runnable notebooks with datasets included where noted (word2vec, tf-idf, text classification)
  • Pre-trained embedding loading via Gensim (GloVe and Word2Vec) with a text-similarity example
  • PySpark phrase extraction and word count for larger-scale text
  • Preprocessing snippets covering stemming, lemmatization, noise removal, and stop-word removal
  • Each technique paired with an explanatory article, not just docstring regurgitation

Caveats

  • Some entries are external repos (phrase-at-scale, word_cloud) rather than in-tree code
  • The “more articles” and mailing-list links suggest this doubles as content marketing; the code appears genuine, but the funnel is visible

Verdict Worth bookmarking if you’re the “just show me a working example” type, especially for sklearn vectorizer gotchas. Skip it if you need deep learning (transformers, etc.) or production-grade pipelines—this is strictly classical NLP territory.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.