← all repositories
lensacom/sparkit-learn

Scikit-learn's API, but your laptop doesn't melt

A compatibility layer that runs familiar sklearn pipelines on PySpark clusters without rewriting your code.

1.1k stars Python ML Frameworks
sparkit-learn
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

Sparkit-learn wraps PySpark to mimic scikit-learn’s API, letting you run vectorizers, transformers, and classifiers on distributed data. You write the same fit_transform calls you’re used to; it handles the partition shuffling behind the scenes.

The interesting bit

The library’s motto is “Think locally, execute distributively.” It achieves this through three custom RDD types—ArrayRDD, SparseRDD, and DictRDD—that break data into numpy arrays or scipy sparse matrices at the block level. This keeps the mental model local while the execution spreads across your cluster.

Key highlights

  • Drop-in replacements for CountVectorizer, HashingVectorizer, TfidfTransformer, LinearSVC, and Pipeline
  • DictRDD supports columnar data with mixed types, enabling familiar X, y splits on distributed datasets
  • Block-level operations preserve numpy/scipy semantics, including slicing, sum(axis=...), and .todense()
  • Requires Spark ≥1.3.0 and Python 2.7 or 3.4

Caveats

  • README shows only text feature extraction and LinearSVC; broader model coverage is unclear
  • Last significant activity appears to target Spark 1.x-era APIs, which may need compatibility testing on modern clusters

Verdict

Worth a look if you’re sitting on a PySpark cluster and want to port sklearn text pipelines without a rewrite. Skip it if you’re already committed to Spark MLlib or need guarantees on current Spark versions.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.