Databricks killed its own Spark-sklearn bridge. Here's the replacement.
The official integration is deprecated; the README now points to joblib-spark as the successor for distributed scikit-learn tuning.

What it does
This was Databricks’ official package for running scikit-learn’s GridSearchCV and similar tools across a Spark cluster. It wrapped scikit-learn’s parallel hyperparameter search so multiple model fits could execute on Spark executors instead of a single machine’s cores. It also converted Spark DataFrames to numpy arrays or sparse matrices.
The interesting bit
The project is now a historical artifact. The README opens with deprecation and immediately redirects users to joblib-spark — a backend that plugs into scikit-learn’s existing parallel_backend system rather than requiring a separate GridSearchCV wrapper. The old package needed you to pass a SparkContext explicitly (GridSearchCV(sc, ...)); the new approach uses a context manager.
Key highlights
- Deprecated by its own maintainers; no new development
- Original scope: small data, embarrassingly parallel search (not distributed algorithms — Spark MLlib handles big data)
- Replacement:
joblib-sparkviapip install joblibspark - Requirements for replacement:
pyspark>=2.4.4,scikit-learn>=0.21 - Original package supported scikit-learn 0.18–0.19 and Spark >= 2.1.1
Caveats
- Tests were already incompatible with scikit-learn 0.20 before deprecation
- The “seamless” DataFrame conversion was advertised but marked as basic utility, not a primary feature
Verdict
Worth reading only if you’re maintaining legacy code that imports spark_sklearn. For new projects, follow the README’s own advice and use joblib-spark directly. If you need distributed learning on large datasets, Spark MLlib was always the intended path.