← all repositories
artidoro/qlora

Fine-tune a 65B model on one GPU without selling your house

QLoRA squeezes giant language models into consumer hardware by backpropagating through frozen 4-bit weights using LoRA adapters.

10.9k stars Jupyter Notebook Language ModelsML Frameworks
qlora
Velocity · 7d
+9.7
★ / day
Trend
steady
star history

What it does

QLoRA lets you fine-tune massive language models — up to 65B parameters — on a single 48GB GPU by keeping the base model frozen and quantized to 4 bits, then training small Low Rank Adapter (LoRA) layers on top. It wraps the bitsandbytes quantization library and plugs into Hugging Face’s PEFT and transformers stacks. The repo includes scripts, Colab notebooks, and pre-trained Guanaco model weights.

The interesting bit

The trick is a stack of memory hacks that sound absurd but apparently work: a custom 4-bit “NormalFloat” data type theoretically optimal for weight distributions, double-quantization (quantizing the quantization constants), and paged optimizers that offload to CPU memory when VRAM spikes. The authors claim this preserves full 16-bit fine-tuning performance while cutting memory enough to train models that normally need multi-GPU setups.

Key highlights

  • Fine-tune 65B models on one 48GB GPU; 7B/13B models run in free Colab tiers
  • Ships with Guanaco model family (7B–65B) trained on OpenAssistant data, plus evaluation scripts using GPT-4 and human ratings
  • Supports LLaMA, LLaMA 2, and T5; multi-GPU training via Accelerate with device_map='auto'
  • Includes inference and fine-tuning Colab notebooks, Gradio demo hosting, and reproduction scripts for Guanaco hyperparameters
  • MIT-licensed code; Guanaco weights require LLaMA license compliance

Caveats

  • 4-bit inference is currently slow — not integrated with optimized 4-bit matmul kernels
  • fp16 compute dtype can destabilize training (only ~80% of 7B LLaMA runs complete without error); bfloat16 or nf4 quantization type recommended
  • Resuming LoRA training runs not supported by Hugging Face Trainer
  • Adding new tokens requires manual embedding updates and storage/reload workaround

Verdict

Researchers and practitioners who need to fine-tune large models on limited hardware should grab this. If you already have an A100 cluster or only need inference, the rough edges around 4-bit speed and stability make it less compelling — though the pre-trained Guanaco weights and evaluation tools are still useful for benchmarking.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.