Is MPP-LLaVA open source?

Yes — Coobiw/MPP-LLaVA is an open-source project tracked on heatdrop.

What language is MPP-LLaVA written in?

Coobiw/MPP-LLaVA is primarily written in Jupyter Notebook.

How popular is MPP-LLaVA?

Coobiw/MPP-LLaVA has 684 stars on GitHub.

Where can I find MPP-LLaVA?

Coobiw/MPP-LLaVA is on GitHub at https://github.com/Coobiw/MPP-LLaVA.

← all repositories

Coobiw/MPP-LLaVA

Pipeline parallelism for the GPU-poor: multimodal Qwen on RTX 4090s

It squeezes 8B and 14B multimodal model training onto RTX 3090/4090 cards using DeepSpeed pipeline parallelism.

★684 stars Jupyter Notebook ML Frameworks Language Models Image · Video · Audio

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

MPP-LLaVA is a personal training and inference stack for Qwen-based multimodal large language models that handle images, video, and multi-turn dialogue. It follows the LLaVA two-stage recipe—pretrain a projection layer, then supervised fine-tune the full language model—but wraps the process in DeepSpeed pipeline parallelism to fit on consumer hardware. Under the hood it is largely glue code: Salesforce LAVIS provides the BLIP-2 vision backbone, QwenLM handles the text, and LLaVA/VideoChatGPT data provides the supervision.

The interesting bit

The real trick is sharding the model across 24GB VRAM cards with pipeline parallelism, turning an enterprise-grade training job into something you can run on a handful of RTX 3090 or 4090 cards. The author also notes an emergent capability: after video fine-tuning, the model gains multi-image comparison skills despite never seeing dedicated multi-image training data.

Key highlights

Trains 8B/14B multimodal models on RTX 3090/4090 GPUs via DeepSpeed pipeline parallelism (PP+DP)
Supports image Q&A, multi-turn dialogue, and video conversations; multi-image understanding emerged after video SFT without specific multi-image training
Built atop LAVIS and QwenLM, using BLIP-2 vision components and LLaVA’s pretrain-then-SFT paradigm
Released SFT weights (~15 GB) and preprocessed datasets, currently hosted on ModelScope and Baidu Netdisk
Offers both CLI and Gradio demos, with multi-GPU inference via device_map="auto"

Caveats

Hugging Face transformers integration is still on the TODO list, so don’t expect a standard HF workflow yet
Weights and datasets are distributed through ModelScope and Baidu Netdisk, which may be less convenient depending on your region
The README warns that absolute paths are sometimes required to avoid file-path errors, suggesting the tooling can be finicky

Verdict

Grab this if you want to experiment with training large multimodal models on local consumer hardware and don’t mind wrestling with a patchwork of upstream tools. Pass if you need a polished, standalone framework with Hugging Face ecosystem integration.

Frequently asked

What is Coobiw/MPP-LLaVA?: It squeezes 8B and 14B multimodal model training onto RTX 3090/4090 cards using DeepSpeed pipeline parallelism.
Is MPP-LLaVA open source?: Yes — Coobiw/MPP-LLaVA is an open-source project tracked on heatdrop.
What language is MPP-LLaVA written in?: Coobiw/MPP-LLaVA is primarily written in Jupyter Notebook.
How popular is MPP-LLaVA?: Coobiw/MPP-LLaVA has 684 stars on GitHub.
Where can I find MPP-LLaVA?: Coobiw/MPP-LLaVA is on GitHub at https://github.com/Coobiw/MPP-LLaVA.