← all repositories
ogrisel/parallel_ml_tutorial

PyCon 2013 tutorial: parallel ML before it was mainstream

A time-capsule notebook collection teaching scikit-learn parallelism via IPython, back when IPython 2.2.0 was current.

1.6k stars Jupyter Notebook LearningML Frameworks
parallel_ml_tutorial
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does A set of executable IPython notebooks from a 2013 PyCon tutorial by Olivier Grisel, covering how to parallelize scikit-learn workflows across cores and cheap EC2 spot instances. Topics span cross-validation, grid search, text feature extraction, memory-mapped numpy arrays, and spinning up clusters with the since-deprecated StarCluster tool.

The interesting bit The tutorial captures a specific evolutionary moment: scikit-learn’s Estimator API was still being learned, IPython (not yet Jupyter) was the interactive frontier, and “cheap parallel compute” meant wrangling StarCluster on Amazon spot instances. The material is adapted from a SciPy 2013 tutorial by Gael Varoquaux and Jake VanderPlas.

Key highlights

  • Static rendered notebooks viewable without installation via nbviewer.ipython.org
  • Covers numpy memory mapping for node-level memory optimization
  • Includes fetch_data.py to pull datasets before running interactively
  • Explicitly targets developers already comfortable with scikit-learn basics
  • Video recording available for following along with notebook titles as section markers

Caveats

  • Setup instructions reference IPython 2.2.0 and scikit-learn 0.15.2; modern environments will need translation
  • StarCluster dependency for EC2 clustering is unmaintained (last release 2013)
  • No updates since original publication; some APIs have evolved significantly

Verdict Worth browsing for historical context on how the scikit-learn ecosystem taught parallelism, or if you’re maintaining legacy IPython-based workflows. Skip if you need current best practices—modern Dask, Ray, or scikit-learn’s own n_jobs patterns have superseded much of this.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.