Is spark-sklearn open source?

Yes — databricks/spark-sklearn is open source, released under the Apache-2.0 license.

What language is spark-sklearn written in?

databricks/spark-sklearn is primarily written in Python.

How popular is spark-sklearn?

databricks/spark-sklearn has 1.1k stars on GitHub.

Where can I find spark-sklearn?

databricks/spark-sklearn is on GitHub at https://github.com/databricks/spark-sklearn.

← all repositories

databricks/spark-sklearn

Databricks killed its own Spark-sklearn bridge. Here's the replacement.

The official integration is deprecated; the README now points to joblib-spark as the successor for distributed scikit-learn tuning.

★1.1k stars Python ML Frameworks

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This was Databricks’ official package for running scikit-learn’s GridSearchCV and similar tools across a Spark cluster. It wrapped scikit-learn’s parallel hyperparameter search so multiple model fits could execute on Spark executors instead of a single machine’s cores. It also converted Spark DataFrames to numpy arrays or sparse matrices.

The interesting bit

The project is now a historical artifact. The README opens with deprecation and immediately redirects users to joblib-spark — a backend that plugs into scikit-learn’s existing parallel_backend system rather than requiring a separate GridSearchCV wrapper. The old package needed you to pass a SparkContext explicitly (GridSearchCV(sc, ...)); the new approach uses a context manager.

Key highlights

Deprecated by its own maintainers; no new development
Original scope: small data, embarrassingly parallel search (not distributed algorithms — Spark MLlib handles big data)
Replacement: joblib-spark via pip install joblibspark
Requirements for replacement: pyspark>=2.4.4, scikit-learn>=0.21
Original package supported scikit-learn 0.18–0.19 and Spark >= 2.1.1

Caveats

Tests were already incompatible with scikit-learn 0.20 before deprecation
The “seamless” DataFrame conversion was advertised but marked as basic utility, not a primary feature

Verdict

Worth reading only if you’re maintaining legacy code that imports spark_sklearn. For new projects, follow the README’s own advice and use joblib-spark directly. If you need distributed learning on large datasets, Spark MLlib was always the intended path.

Frequently asked

What is databricks/spark-sklearn?: The official integration is deprecated; the README now points to joblib-spark as the successor for distributed scikit-learn tuning.
Is spark-sklearn open source?: Yes — databricks/spark-sklearn is open source, released under the Apache-2.0 license.
What language is spark-sklearn written in?: databricks/spark-sklearn is primarily written in Python.
How popular is spark-sklearn?: databricks/spark-sklearn has 1.1k stars on GitHub.
Where can I find spark-sklearn?: databricks/spark-sklearn is on GitHub at https://github.com/databricks/spark-sklearn.