Chat with images using a frozen LLM and a visual encoder
MiniGPT-4 bolts a vision encoder onto Vicuna or Llama 2 so the LLM can see, describe, and reason about images without retraining the language model itself.

What it does
MiniGPT-4 and its successor MiniGPT-v2 are vision-language models that let you upload an image and ask questions about it, get captions, or request creative tasks like writing poems or stories inspired by what the model sees. The architecture is deliberately simple: a frozen visual encoder (from BLIP-2) feeds into a frozen large language model (Vicuna 7B/13B or Llama 2 Chat 7B) with a lightweight projection layer trained to align the two. The repo includes training and fine-tuning code, evaluation scripts, and local Gradio demos.
The interesting bit
The “mini” framing is slightly cheeky — the LLM itself isn’t mini at all, it’s just not retrained. The actual training budget goes into that thin adapter between vision and text. MiniGPT-v2 extends this to a “unified interface” for multiple vision-language tasks, suggesting the same frozen LLM backbone can handle diverse prompts with task-specific token prefixes rather than separate fine-tuned heads.
Key highlights
- Supports Vicuna 7B/13B and Llama 2 Chat 7B as the language backbone
- Runs locally with ~11.5 GB VRAM for 7B models (8-bit), ~23 GB for 13B
- Includes Colab notebook and Hugging Face Spaces demos for quick testing
- Community has forked it for dermatology diagnosis (SkinGPT-4), art analysis (ArtGPT-4), patent figure captioning (PatFig), and instruction tuning
- BSD 3-Clause license; built on top of Salesforce’s LAVIS framework
Caveats
- Setup is manual: you must download LLM weights separately via git-lfs, download checkpoint files from Google Drive, and edit YAML config paths by hand
- The README contains broken links (“Downlad” typo for Vicuna 13B) and inconsistent line number references that may drift
- No quantitative benchmarks or accuracy claims are stated in the README; performance is left to the demo examples and linked papers
Verdict
Worth a look if you want a hackable, relatively lightweight vision-language baseline with clear training code. Skip it if you need a polished product or plug-and-play API — this is research code with research-code edges.