Is minimind-v open source?

Yes — jingyaogong/minimind-v is open source, released under the Apache-2.0 license.

What language is minimind-v written in?

jingyaogong/minimind-v is primarily written in Python.

How popular is minimind-v?

jingyaogong/minimind-v has 8.4k stars on GitHub and is currently cooling off.

Where can I find minimind-v?

jingyaogong/minimind-v is on GitHub at https://github.com/jingyaogong/minimind-v.

← all repositories

jingyaogong/minimind-v

Train a see-and-chat model in two hours on one GPU

A minimal VLM you can train from scratch on one GPU in two hours for the price of a coffee.

★8.4k stars Python Language Models Image · Video · Audio

View on GitHub ↗ Homepage ↗

Velocity · 7d

+5.1

★ / day

Trend

↘cooling

star history

What it does

MiniMind-V is a family of vision-language models that starts at 65 million parameters. It straps a SigLIP2 image encoder and a compact MLP projector onto the existing MiniMind language backbone, then runs pre-training and supervised fine-tuning to produce a model that can look at an image and talk about it. The repository contains the full stack—data cleaning, model definitions, training scripts, and a WebUI—framed as both a working tiny model and a readable tutorial on VLM internals.

The interesting bit

The project treats extreme minimalism as a feature, not a bug. The authors claim the SFT stage finishes in roughly two hours on a single RTX 3090 for about $3 in cloud rental, achieved largely by freezing the LLM’s middle layers and training only the vision projector plus the first and last transformer layers. It is essentially an experiment in how little compute and parameter mass are needed before image-and-text behavior starts to emerge.

Key highlights

Smallest variant is ~1/2600th the size of GPT-3; model sizes range from 26M up to a 200M MoE variant
Vision pipeline uses SiglipVisionModel (256×256 fixed input) feeding into an MLP projector with LayerNorm
Dataset ships as unified Parquet files (2.9M SFT samples with pre-training subset merged), removing the need to unpack thousands of loose images
Training supports PyTorch DDP, bfloat16, torch.compile, and resuming across different GPU counts
Released in both raw PyTorch .pth and HuggingFace Transformers formats, with an included WebUI for local inference

Caveats

The “two hours / three dollars” figure refers specifically to the SFT stage (one epoch on an RTX 3090), not the complete pipeline or pre-training
It is explicitly positioned as a minimal teaching implementation; expect tutorial-grade performance, not production accuracy
Pre-training is listed as optional, and the README notes you can jump straight to SFT, though it is unclear how much that affects final quality

Verdict

Grab it if you are a student, researcher, or hobbyist who wants to dissect a VLM without renting a GPU cluster. Look elsewhere if you need commercial-grade vision reasoning; this is a dissectible frog, not a finished product.

Frequently asked

What is jingyaogong/minimind-v?: A minimal VLM you can train from scratch on one GPU in two hours for the price of a coffee.
Is minimind-v open source?: Yes — jingyaogong/minimind-v is open source, released under the Apache-2.0 license.
What language is minimind-v written in?: jingyaogong/minimind-v is primarily written in Python.
How popular is minimind-v?: jingyaogong/minimind-v has 8.4k stars on GitHub and is currently cooling off.
Where can I find minimind-v?: jingyaogong/minimind-v is on GitHub at https://github.com/jingyaogong/minimind-v.