Is SpeedTorch open source?

Yes — Santosh-Gupta/SpeedTorch is open source, released under the MIT license.

What language is SpeedTorch written in?

Santosh-Gupta/SpeedTorch is primarily written in Python.

How popular is SpeedTorch?

Santosh-Gupta/SpeedTorch has 682 stars on GitHub.

Where can I find SpeedTorch?

Santosh-Gupta/SpeedTorch is on GitHub at https://github.com/Santosh-Gupta/SpeedTorch.

← all repositories

Santosh-Gupta/SpeedTorch

A Cupy trick that outruns PyTorch's own CPU↔GPU transfer

SpeedTorch masquerades CPU tensors as GPU tensors to dodge PyTorch indexing overhead, buying speed on low-core machines at a memory cost.

★682 stars Python ML Frameworks Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

SpeedTorch is a utility library that wraps Cupy tensors to accelerate data transfer between CPU pinned memory and PyTorch GPU variables. It also provides factory classes for hosting embeddings on CPU RAM during idle training steps, letting you train larger sparse embedding models without maxing out GPU memory.

The interesting bit

The speedup isn’t from faster PCIe transfer—it’s from avoiding PyTorch’s CPU indexing kernels entirely. SpeedTorch “masquerades” CPU tensors as GPU tensors so Cupy’s indexing kernels handle the work instead. On 1–2 core systems (hello, free Colab tier), this wins big. On many-core machines, PyTorch’s own indexing catches up and the advantage evaporates.

Key highlights

3.1× faster CPU→GPU transfer and 410× faster GPU→CPU transfer than PyTorch pinned CPU tensors on a 2-core Colab instance (Tesla K80, 131k embeddings × 128 dims)
GPU↔GPU transfers also win because SpeedTorch sidesteps a PyTorch indexing bug present in v1.1/1.2 (fixed in nightly, or avoidable with index_select/index_copy_)
Supports non-sparse optimizers (Adam, Adamax, RMSprop, etc.) for sparse embedding training by keeping all parameters in a dense variable with CPU offloading
Memory tradeoff: Cupy pinned CPU tensors use ~10 GB vs PyTorch’s ~6.6 GB for 10M×128 float32; GPU footprint is smaller (0.06 GB vs 0.32 GB)
Requires Cupy pre-installed; pip-installable wrapper with DataGadget and factory classes for models/optimizers

Caveats

Speed advantage is system-dependent: more CPU cores favor PyTorch; always benchmark on your own hardware with your data sizes
The GPU↔GPU speedup is essentially a workaround for a since-fixed PyTorch bug—modern PyTorch versions or proper indexing syntax may negate the benefit
README contains typos and rough phrasing (“revovles,” “masquarding,” “augment tra”) that suggest limited maintenance

Verdict

Worth a look if you’re stuck on low-core hardware, fighting GPU memory limits with huge embeddings, or running older PyTorch where the indexing bug still bites. If you’re on a many-core workstation or current PyTorch nightly, the gains likely shrink to noise—just use index_select and move on.

Frequently asked

What is Santosh-Gupta/SpeedTorch?: SpeedTorch masquerades CPU tensors as GPU tensors to dodge PyTorch indexing overhead, buying speed on low-core machines at a memory cost.
Is SpeedTorch open source?: Yes — Santosh-Gupta/SpeedTorch is open source, released under the MIT license.
What language is SpeedTorch written in?: Santosh-Gupta/SpeedTorch is primarily written in Python.
How popular is SpeedTorch?: Santosh-Gupta/SpeedTorch has 682 stars on GitHub.
Where can I find SpeedTorch?: Santosh-Gupta/SpeedTorch is on GitHub at https://github.com/Santosh-Gupta/SpeedTorch.