Keras on Spark: a physics lab's take on distributed training
CERN's dist-keras wraps Keras models in Apache Spark to run data-parallel deep learning across clusters, with a research-friendly focus on pluggable distributed optimizers.

What it does
dist-keras lets you train Keras models on Apache Spark clusters using data-parallel methods: multiple model replicas work on shards of data, periodically synchronizing parameters. It bundles several distributed optimizers—DOWNPOUR, EASGD variants, model averaging, ensemble training—and wraps them in Spark-friendly Python classes you instantiate like regular Keras trainers.
The interesting bit
The project treats distributed optimizers as swappable research primitives. The author implemented custom methods like ADAG (a less hyperparameter-sensitive DOWNPOUR variant) and DynSGD (which adapts learning rates per-worker based on parameter staleness) directly from recent academic work. There’s even a lightweight “Punchcard” job server for remote cluster submission via HTTP—handy if your dev machine isn’t your compute cluster.
Key highlights
- Ships with 7+ distributed training strategies, from basic model averaging to asynchronous elastic averaging SGD
- ADAG is flagged as “currently recommended” by the authors based on their own experiments
- Includes remote job deployment through a secret-token-based Punchcard server
- Ensemble training trains
nfull models in parallel, then averages predictions - CERN IT-DB origin; comes with a BibTeX citation block for academic use
Caveats
- Python 3 compatibility is listed as a known issue
- README warns that adding more asynchronous workers can hurt statistical performance (“implicit momentum” claims are noted but flagged as needing more research)
- Several TODOs remain open: HDFS model save/load, network compression, multi-parameter-server support
Verdict
Worth a look if you’re already on Spark and want to experiment with distributed training algorithms without writing parameter servers from scratch. Skip it if you need Python 3, modern Keras/TensorFlow 2.x, or production-grade fault tolerance—the project appears research-oriented and somewhat dormant.