Yahoo's deep learning bridge to Hadoop is archived, not forgotten
A 2016 attempt to run Caffe on Spark clusters without building a separate GPU farm.

What it does CaffeOnSpark wraps the Caffe deep learning framework into a Spark package, letting you train neural networks on Hadoop clusters using HDFS-stored data. It supports training, testing, and feature extraction across GPU and CPU servers, with a Scala API for Spark applications.
The interesting bit The server-to-server direct communication over Ethernet or InfiniBand was the real architectural bet — it aimed to dodge the “separate deep learning cluster” tax that most organizations faced. Yahoo ran this in production for image search and content classification on their private cloud.
Key highlights
- Reuses existing Caffe LMDB datasets and prototxt configs with minor tweaks
- Spark 1.x and 2.x support (default: Spark 2.0.0, Hadoop 2.7.1, Scala 2.11.7)
- Incremental learning from prior models or snapshots
- Deployable on AWS EC2 or private cloud
- Per-device batch sizes in prototxt files
Caveats
- Archived and unsupported since 2016 — Yahoo explicitly notes they’re no longer maintaining it
- Memory layers require
"share_in_parallel: false"to avoid GPU sharing issues - Build versions are pinned in
caffe-grid/pom.xmland likely stale
Verdict Worth reading if you’re studying how big tech bridged pre-TensorFlow deep learning onto existing data infrastructure. Skip it if you need something that runs today — this is a fossil, not a foundation.