Before You Download 14 GB of Weights, Check Your VRAM
A Python CLI that estimates whether a Hugging Face model will actually fine-tune on your local GPU before you burn disk space and patience.

What it does
canifinetune is a Python CLI that estimates VRAM consumption for fine-tuning open-weight LLMs with LoRA or QLoRA on a single consumer GPU. It breaks down memory into weights, trainable parameters, gradients, optimizer states, activations, and a safety margin for CUDA fragmentation, then renders a feasibility verdict with a confidence level. If the config does not fit, it suggests concrete downgrades like lower sequence length or rank.
The interesting bit
Unlike accelerate estimate-memory, which only models model loading, this tries to capture the messier reality of training—including bitsandbytes unpacking buffers and activation scaling—then closes the gap with local micro-benchmarks and calibration. The README openly admits its static estimates can be wrong (one example shows 3.16 GB predicted vs 7.10 GB measured on an RTX 4080), and treats those discrepancies as the reason to run real smoke tests rather than a bug.
Key highlights
- Static estimates work without PyTorch installed; the core package is deliberately lightweight.
- Models training memory, not just inference loading: activations scale with
seq_len,batch_size, and layer count, and it accounts for gradient checkpointing toggles. - Includes real RTX 4080 baselines and a
benchcommand that runs tiny training steps on your actual hardware to ground-truth the math. - Generates ready-to-run Hugging Face + PEFT + TRL recipes once you settle on a feasible config.
- Marks every estimate with explicit assumptions and a confidence level, so you know when the model is guessing.
Caveats
- Scope is currently locked to single-GPU, single-node, causal LMs with LoRA/QLoRA; multi-GPU setups like DeepSpeed ZeRO or FSDP are on the uncommitted roadmap.
- The tool is upfront that activation memory is hard to predict statically, so estimates without local calibration should be treated as directional, not gospel.
Verdict
Worth a look if you are fine-tuning open LLMs on a 12–24 GB consumer card and tired of discovering OOMs after lengthy downloads. Skip it if you are already running multi-GPU training clusters with dedicated infrastructure.