A Cupy trick that outruns PyTorch's own CPU↔GPU transfer
SpeedTorch masquerades CPU tensors as GPU tensors to dodge PyTorch indexing overhead, buying speed on low-core machines at a memory cost.

What it does
SpeedTorch is a utility library that wraps Cupy tensors to accelerate data transfer between CPU pinned memory and PyTorch GPU variables. It also provides factory classes for hosting embeddings on CPU RAM during idle training steps, letting you train larger sparse embedding models without maxing out GPU memory.
The interesting bit
The speedup isn’t from faster PCIe transfer—it’s from avoiding PyTorch’s CPU indexing kernels entirely. SpeedTorch “masquerades” CPU tensors as GPU tensors so Cupy’s indexing kernels handle the work instead. On 1–2 core systems (hello, free Colab tier), this wins big. On many-core machines, PyTorch’s own indexing catches up and the advantage evaporates.
Key highlights
- 3.1× faster CPU→GPU transfer and 410× faster GPU→CPU transfer than PyTorch pinned CPU tensors on a 2-core Colab instance (Tesla K80, 131k embeddings × 128 dims)
- GPU↔GPU transfers also win because SpeedTorch sidesteps a PyTorch indexing bug present in v1.1/1.2 (fixed in nightly, or avoidable with
index_select/index_copy_) - Supports non-sparse optimizers (Adam, Adamax, RMSprop, etc.) for sparse embedding training by keeping all parameters in a dense variable with CPU offloading
- Memory tradeoff: Cupy pinned CPU tensors use ~10 GB vs PyTorch’s ~6.6 GB for 10M×128 float32; GPU footprint is smaller (0.06 GB vs 0.32 GB)
- Requires Cupy pre-installed; pip-installable wrapper with
DataGadgetand factory classes for models/optimizers
Caveats
- Speed advantage is system-dependent: more CPU cores favor PyTorch; always benchmark on your own hardware with your data sizes
- The GPU↔GPU speedup is essentially a workaround for a since-fixed PyTorch bug—modern PyTorch versions or proper indexing syntax may negate the benefit
- README contains typos and rough phrasing (“revovles,” “masquarding,” “augment tra”) that suggest limited maintenance
Verdict
Worth a look if you’re stuck on low-core hardware, fighting GPU memory limits with huge embeddings, or running older PyTorch where the indexing bug still bites. If you’re on a many-core workstation or current PyTorch nightly, the gains likely shrink to noise—just use index_select and move on.