← all repositories
Stonesjtu/pytorch_memlab

Find which line of PyTorch code ate your GPU memory

A line-by-line CUDA memory profiler that shows exactly where your tensors are born, live, and refuse to die.

1.1k stars Python LLMOps · EvalML Frameworks
pytorch_memlab
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does

pytorch_memlab is a debugging toolkit for PyTorch CUDA out-of-memory errors. It provides a line-by-line memory profiler (inspired by line_profiler), a tensor inspector that digs into underlying UntypedStorage objects, and Jupyter/IPython magic commands for interactive use. There’s even a “courtesy” feature to temporarily shove all CUDA tensors to CPU RAM when you need to free up the GPU for something else.

The interesting bit

The Memory Reporter doesn’t trust Tensor.size — it walks the actual UntypedStorage objects to report real memory usage, and it tracks storage sharing with (->) annotations so you can see when multiple tensors point to the same underlying buffer. The profiler, meanwhile, hooks each line of your function and reports peak active vs. reserved bytes, which is the difference between “this tensor exists” and “the allocator grabbed a chunk.”

Key highlights

  • @profile decorator gives per-line CUDA memory stats (active bytes, reserved bytes, peak usage)
  • MemReporter inspects actual UntypedStorage usage, not surface-level tensor sizes; handles shared weights correctly
  • %mlrun / %%mlrun IPython magics for notebook profiling without code changes
  • set_target_gpu for multi-GPU profiling, though the selection is global state you must manually track
  • “Courtesy” mode: temporarily migrate all CUDA tensors to CPU and back

Caveats

  • GPU target selection is global mutable state; easy to profile the wrong device by accident
  • README is truncated mid-example for the LSTM case, so the full verbose output isn’t visible
  • No mention of PyTorch 2.x compatibility or torch.compile interaction

Verdict

Worth a look if you’re debugging OOMs in training loops and torch.cuda.memory_summary() isn’t granular enough. Less useful if you’re already on PyTorch’s built-in memory profiling or need multi-GPU tracing with automatic device tracking.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.