NVlabs/VILA
VILA is a family of open vision language models optimized for video and multi-image understanding tasks.

Velocity · 7d
+4.6
★ / day
Trend
→steady
star history
VILA provides a suite of vision language models designed for efficient multimodal AI across edge, data center, and cloud deployments. The project includes models for video understanding, high-resolution image processing, and long-context video analysis. Recent releases cover OmniVinci for visual-audio joint understanding, LongVILA for million-token context windows, and NVILA for full-stack efficiency optimization of multi-modal model design.