Training neural nets on CPUs faster than GPUs, with hash tables
SLIDE uses locality-sensitive hashing to skip unnecessary neurons, making CPU training competitive with GPU baselines on massive output layers.

What it does
SLIDE is a C++ training framework for neural networks with enormous output spaces — think recommendation systems with millions of items. Instead of computing every output logit, it uses locality-sensitive hashing to activate only a small subset of neurons per sample. The repo reproduces the original NeurIPS 2019 paper; a more optimized fork lives elsewhere.
The interesting bit
The core bet: smart sparse algorithms can beat brute-force GPU parallelism when the bottleneck is memory bandwidth, not compute. The README is admirably blunt about needing 900+ transparent huge pages and Skylake-or-newer AVX-512 — this is not plug-and-play software.
Key highlights
- Sparse activation via LSH replaces dense matrix multiplications on the final layer
- Ships with TensorFlow baselines (full and sampled softmax) for the Amazon-670K dataset
- Self-contained build: pulls ZLIB and CNPY automatically
- Docker image available for the brave; manual hugepage configuration required otherwise
- Newer CPU-optimized version (BFloat16, better memory layout) maintained at RUSH-LAB/SLIDE
Caveats
- README warns: revert to an older commit or lose ~30% performance if your kernel lacks hugepage support
- Dataset must be manually shuffled; labels arrive sorted, which will quietly wreck your validation metrics
- Code targets a single benchmark dataset; generalization to other architectures is unclear from the docs
Verdict
Worth a look if you’re researching extreme-classification or sparse training methods, or if you’re stuck on CPU-only hardware and need to train million-output models. Skip it if you want a maintained, general-purpose framework — this is research reproduction code with sharp edges.