← all repositories
yahoo/CaffeOnSpark

Yahoo's deep learning bridge to Hadoop is archived, not forgotten

A 2016 attempt to run Caffe on Spark clusters without building a separate GPU farm.

1.3k stars Jupyter Notebook ML Frameworks
CaffeOnSpark
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does CaffeOnSpark wraps the Caffe deep learning framework into a Spark package, letting you train neural networks on Hadoop clusters using HDFS-stored data. It supports training, testing, and feature extraction across GPU and CPU servers, with a Scala API for Spark applications.

The interesting bit The server-to-server direct communication over Ethernet or InfiniBand was the real architectural bet — it aimed to dodge the “separate deep learning cluster” tax that most organizations faced. Yahoo ran this in production for image search and content classification on their private cloud.

Key highlights

  • Reuses existing Caffe LMDB datasets and prototxt configs with minor tweaks
  • Spark 1.x and 2.x support (default: Spark 2.0.0, Hadoop 2.7.1, Scala 2.11.7)
  • Incremental learning from prior models or snapshots
  • Deployable on AWS EC2 or private cloud
  • Per-device batch sizes in prototxt files

Caveats

  • Archived and unsupported since 2016 — Yahoo explicitly notes they’re no longer maintaining it
  • Memory layers require "share_in_parallel: false" to avoid GPU sharing issues
  • Build versions are pinned in caffe-grid/pom.xml and likely stale

Verdict Worth reading if you’re studying how big tech bridged pre-TensorFlow deep learning onto existing data infrastructure. Skip it if you need something that runs today — this is a fossil, not a foundation.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.