← all repositories
haotian-liu/LLaVA

Teaching LLMs to see without billion-dollar budgets

An open-source vision-language model that trains in a day and runs on modest hardware.

LLaVA
Velocity · 7d
+22
★ / day
Trend
steady
star history

What it does

LLaVA bolts a vision encoder onto large language models (LLaMA, Qwen, and others) so the resulting model can chat about images, answer visual questions, and follow complex instructions involving what’s on screen. The project ships training code, inference scripts, and a model zoo with checkpoints from 7B to 110B parameters.

The interesting bit

The core trick is visual instruction tuning: using GPT-4 to generate multimodal instruction-following data from image captions, then training the whole stack end-to-end. The original LLaVA-1.5 reportedly trains in ~1 day on a single 8×A100 node using only public data, yet matches or beats models trained on billion-scale datasets. A later variant, LLaVA-NeXT, processes 4× more pixels and apparently outperforms Gemini Pro on some benchmarks.

Key highlights

  • Supports LoRA fine-tuning with “comparable performance as full-model finetuning” and lower GPU RAM requirements
  • 4-bit/5-bit quantization via llama.cpp; community reports running 13B models on 12 GB VRAM
  • Zero-shot video understanding in LLaVA-NeXT despite image-only training
  • RLHF-tuned variants (LLaVA-RLHF) for reduced hallucination
  • Extensive ecosystem: Colab notebooks, HuggingFace Spaces, AutoGen integration, biomedical spinoff (LLaVA-Med)

Caveats

  • The README warns non-Linux users off the default install path; macOS and Windows require separate docs
  • License stack is complicated: Apache 2.0 for the code, but model checkpoints inherit Llama/Qwen/OpenAI dataset terms, so commercial use depends on which base model you pick

Verdict

Worth a look if you need an open, hackable vision-language model you can actually train and deploy without a corporate cluster. Skip it if you want a polished API product with clean liability lines.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.