← all repositories
databricks/spark-sklearn

Databricks killed its own Spark-sklearn bridge. Here's the replacement.

The official integration is deprecated; the README now points to joblib-spark as the successor for distributed scikit-learn tuning.

1.1k stars Python ML Frameworks
spark-sklearn
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

This was Databricks’ official package for running scikit-learn’s GridSearchCV and similar tools across a Spark cluster. It wrapped scikit-learn’s parallel hyperparameter search so multiple model fits could execute on Spark executors instead of a single machine’s cores. It also converted Spark DataFrames to numpy arrays or sparse matrices.

The interesting bit

The project is now a historical artifact. The README opens with deprecation and immediately redirects users to joblib-spark — a backend that plugs into scikit-learn’s existing parallel_backend system rather than requiring a separate GridSearchCV wrapper. The old package needed you to pass a SparkContext explicitly (GridSearchCV(sc, ...)); the new approach uses a context manager.

Key highlights

  • Deprecated by its own maintainers; no new development
  • Original scope: small data, embarrassingly parallel search (not distributed algorithms — Spark MLlib handles big data)
  • Replacement: joblib-spark via pip install joblibspark
  • Requirements for replacement: pyspark>=2.4.4, scikit-learn>=0.21
  • Original package supported scikit-learn 0.18–0.19 and Spark >= 2.1.1

Caveats

  • Tests were already incompatible with scikit-learn 0.20 before deprecation
  • The “seamless” DataFrame conversion was advertised but marked as basic utility, not a primary feature

Verdict

Worth reading only if you’re maintaining legacy code that imports spark_sklearn. For new projects, follow the README’s own advice and use joblib-spark directly. If you need distributed learning on large datasets, Spark MLlib was always the intended path.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.