← all repositories
cerndb/dist-keras

Keras on Spark: a physics lab's take on distributed training

CERN's dist-keras wraps Keras models in Apache Spark to run data-parallel deep learning across clusters, with a research-friendly focus on pluggable distributed optimizers.

dist-keras
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

dist-keras lets you train Keras models on Apache Spark clusters using data-parallel methods: multiple model replicas work on shards of data, periodically synchronizing parameters. It bundles several distributed optimizers—DOWNPOUR, EASGD variants, model averaging, ensemble training—and wraps them in Spark-friendly Python classes you instantiate like regular Keras trainers.

The interesting bit

The project treats distributed optimizers as swappable research primitives. The author implemented custom methods like ADAG (a less hyperparameter-sensitive DOWNPOUR variant) and DynSGD (which adapts learning rates per-worker based on parameter staleness) directly from recent academic work. There’s even a lightweight “Punchcard” job server for remote cluster submission via HTTP—handy if your dev machine isn’t your compute cluster.

Key highlights

  • Ships with 7+ distributed training strategies, from basic model averaging to asynchronous elastic averaging SGD
  • ADAG is flagged as “currently recommended” by the authors based on their own experiments
  • Includes remote job deployment through a secret-token-based Punchcard server
  • Ensemble training trains n full models in parallel, then averages predictions
  • CERN IT-DB origin; comes with a BibTeX citation block for academic use

Caveats

  • Python 3 compatibility is listed as a known issue
  • README warns that adding more asynchronous workers can hurt statistical performance (“implicit momentum” claims are noted but flagged as needing more research)
  • Several TODOs remain open: HDFS model save/load, network compression, multi-parameter-server support

Verdict

Worth a look if you’re already on Spark and want to experiment with distributed training algorithms without writing parameter servers from scratch. Skip it if you need Python 3, modern Keras/TensorFlow 2.x, or production-grade fault tolerance—the project appears research-oriented and somewhat dormant.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.