← all repositories

ictnlp/LLaVA-Mini

A large multimodal model that understands images and video using only 1 vision token per image, reducing FLOPs by 77%.

LLaVA-Mini
Velocity · 7d
+1.1
★ / day
Trend
steady
star history

LLaVA-Mini is a unified large multimodal model designed for efficient image and video understanding. It achieves comparable performance to LLaVA-v1.5 while using only 1 vision token instead of 576 (0.17% compression). The model processes high-resolution images and videos with significantly reduced computational cost: 77% FLOPs reduction, latency reduced from 100ms to 40ms, and VRAM usage dropping from 360MB to 0.6MB per image. Available models include an 8B parameter version based on Llama 3.1.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.