ictnlp/LLaVA-Mini
A large multimodal model that understands images and video using only 1 vision token per image, reducing FLOPs by 77%.

LLaVA-Mini is a unified large multimodal model designed for efficient image and video understanding. It achieves comparable performance to LLaVA-v1.5 while using only 1 vision token instead of 576 (0.17% compression). The model processes high-resolution images and videos with significantly reduced computational cost: 77% FLOPs reduction, latency reduced from 100ms to 40ms, and VRAM usage dropping from 360MB to 0.6MB per image. Available models include an 8B parameter version based on Llama 3.1.