Is sparkit-learn open source?

Yes — lensacom/sparkit-learn is open source, released under the Apache-2.0 license.

What language is sparkit-learn written in?

lensacom/sparkit-learn is primarily written in Python.

How popular is sparkit-learn?

lensacom/sparkit-learn has 1.2k stars on GitHub.

Where can I find sparkit-learn?

lensacom/sparkit-learn is on GitHub at https://github.com/lensacom/sparkit-learn.

← all repositories

lensacom/sparkit-learn

Scikit-learn's API, but your laptop doesn't melt

A compatibility layer that runs familiar sklearn pipelines on PySpark clusters without rewriting your code.

★1.2k stars Python ML Frameworks

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Sparkit-learn wraps PySpark to mimic scikit-learn’s API, letting you run vectorizers, transformers, and classifiers on distributed data. You write the same fit_transform calls you’re used to; it handles the partition shuffling behind the scenes.

The interesting bit

The library’s motto is “Think locally, execute distributively.” It achieves this through three custom RDD types—ArrayRDD, SparseRDD, and DictRDD—that break data into numpy arrays or scipy sparse matrices at the block level. This keeps the mental model local while the execution spreads across your cluster.

Key highlights

Drop-in replacements for CountVectorizer, HashingVectorizer, TfidfTransformer, LinearSVC, and Pipeline
DictRDD supports columnar data with mixed types, enabling familiar X, y splits on distributed datasets
Block-level operations preserve numpy/scipy semantics, including slicing, sum(axis=...), and .todense()
Requires Spark ≥1.3.0 and Python 2.7 or 3.4

Caveats

README shows only text feature extraction and LinearSVC; broader model coverage is unclear
Last significant activity appears to target Spark 1.x-era APIs, which may need compatibility testing on modern clusters

Verdict

Worth a look if you’re sitting on a PySpark cluster and want to port sklearn text pipelines without a rewrite. Skip it if you’re already committed to Spark MLlib or need guarantees on current Spark versions.

Frequently asked

What is lensacom/sparkit-learn?: A compatibility layer that runs familiar sklearn pipelines on PySpark clusters without rewriting your code.
Is sparkit-learn open source?: Yes — lensacom/sparkit-learn is open source, released under the Apache-2.0 license.
What language is sparkit-learn written in?: lensacom/sparkit-learn is primarily written in Python.
How popular is sparkit-learn?: lensacom/sparkit-learn has 1.2k stars on GitHub.
Where can I find sparkit-learn?: lensacom/sparkit-learn is on GitHub at https://github.com/lensacom/sparkit-learn.