The tensor library MXNet absorbed and forgot to mention
A 2010s-era C++ template library that let you write lazy GPU kernels without knowing CUDA, now frozen in amber inside Apache MXNet.

What it does
MShadow is a header-only C++ tensor library that compiles expression templates into CPU or CUDA kernels at build time. You write A = B + C * 2 and it generates a fused kernel—no temporary allocations, no explicit CUDA. It also shipped a parameter-server interface for multi-GPU and distributed training.
The interesting bit
The “whitebox” design: you hand it a raw float* wrapped in a Tensor struct, and the machinery takes over. No hidden memory pools, no opaque handles. In an era of PyTorch’s eager execution and TensorFlow’s graph bloat, this was almost aggressively transparent.
Key highlights
- Lazy expression templates compile to per-expression kernels; zero temporaries
- Single source runs on CPU and GPU without
#ifdefsoup - Extensible: custom ops plug in without CUDA knowledge
- mshadow-ps interface unified multi-GPU and distributed training
- Donated to Apache MXNet; repo is deprecated and read-only
Caveats
- Deprecated since ~2017; all development moved to MXNet, which itself is now in maintenance mode
- mshadow-2.x broke backward compatibility with 1.x, and legacy code needs pinned releases
- README links to Travis CI (RIP) and documentation paths that may be stale
Verdict
Worth studying if you’re building a tensor library or curious about expression-template metaprogramming in C++. Not worth adopting for new work—modern alternatives (XTensor, Kokkos, or just plain PyTorch C++) have more oxygen. Historians of the DMLC ecosystem will find the missing link between CXXNet and MXNet.