← all repositories
JohnSnowLabs/spark-nlp-workshop

1,088 stars, zero hype: a Spark NLP cookbook that just works

A sprawling repo of runnable notebooks for the Spark NLP ecosystem, from annotation to training to Databricks.

1.1k stars Jupyter Notebook ML FrameworksLanguage ModelsLearning
spark-nlp-workshop
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does This is the official companion kitchen for John Snow Labs’ Spark NLP library: Jupyter notebooks, Colab-ready tutorials, and Databricks notebooks covering annotation pipelines, model training, and evaluation. The setup instructions are refreshingly explicit — Java 8, PySpark 3.1.2, pip install, done. There’s even a one-liner shell script for Colab that downloads and configures everything behind the scenes.

The interesting bit The “old_generation_notebooks” folder in tutorials suggests this repo has been through enough iterations to accumulate historical baggage, yet someone is still maintaining backward compatibility. That’s either admirable diligence or a warning about API churn — the README doesn’t clarify which.

Key highlights

  • Python and Scala examples side by side (rare in notebook-land)
  • Dedicated Databricks folder for enterprise Spark deployments
  • One-shot Colab setup via wget | bash — convenient, if you trust it
  • Explicit dependency pinning (PySpark 3.1.2) rather than “latest and hope”
  • Apache 2.0 licensed, with a Slack community linked for support

Caveats

  • The “evalulation” typo in the table of contents has survived at least one README revision
  • No topics tagged on GitHub, making discovery harder than it should be
  • “Old generation” notebooks are still prominently linked; unclear if they’re deprecated or merely archived

Verdict Grab this if you’re already committed to Spark NLP and need working starter code. Skip it if you’re looking for a general NLP tutorial — the Spark dependency and JVM tooling make this a niche affair.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.