Coobiw/MPP-LLaVA
A multimodal pipeline parallel training framework for Qwen-based large language models supporting image, video, and multi-image inputs.

This repository provides a distributed training system for multimodal large language models based on Qwen-LM, enabling fine-tuning of 8B/14B models on consumer GPUs like RTX3090/4090 with 24GB memory. It implements pipeline parallelism (PP) combined with data parallelism (DP) using DeepSpeed, supporting supervised fine-tuning on image, video, and multi-image conversational data. The framework enables training LLaVA-like multimodal LLMs without requiring expensive enterprise hardware.