1,088 stars, zero hype: a Spark NLP cookbook that just works
A sprawling repo of runnable notebooks for the Spark NLP ecosystem, from annotation to training to Databricks.

What it does This is the official companion kitchen for John Snow Labs’ Spark NLP library: Jupyter notebooks, Colab-ready tutorials, and Databricks notebooks covering annotation pipelines, model training, and evaluation. The setup instructions are refreshingly explicit — Java 8, PySpark 3.1.2, pip install, done. There’s even a one-liner shell script for Colab that downloads and configures everything behind the scenes.
The interesting bit The “old_generation_notebooks” folder in tutorials suggests this repo has been through enough iterations to accumulate historical baggage, yet someone is still maintaining backward compatibility. That’s either admirable diligence or a warning about API churn — the README doesn’t clarify which.
Key highlights
- Python and Scala examples side by side (rare in notebook-land)
- Dedicated Databricks folder for enterprise Spark deployments
- One-shot Colab setup via
wget | bash— convenient, if you trust it - Explicit dependency pinning (PySpark 3.1.2) rather than “latest and hope”
- Apache 2.0 licensed, with a Slack community linked for support
Caveats
- The “evalulation” typo in the table of contents has survived at least one README revision
- No topics tagged on GitHub, making discovery harder than it should be
- “Old generation” notebooks are still prominently linked; unclear if they’re deprecated or merely archived
Verdict Grab this if you’re already committed to Spark NLP and need working starter code. Skip it if you’re looking for a general NLP tutorial — the Spark dependency and JVM tooling make this a niche affair.