BAAI-DCAI/Bunny
A family of lightweight multimodal models combining vision encoders with language backbones like Llama-3 and Phi-3.

Velocity · 7d
+1.2
★ / day
Trend
→steady
star history
Bunny provides a suite of multimodal language models that combine plug-and-play vision encoders (EVA-CLIP, SigLIP) with language backbones (Llama-3-8B, Phi-3-mini, StableLM-2, Qwen1.5, etc.). The models support high-resolution image input up to 1152x1152 and aim to deliver competitive performance against larger MLLMs while maintaining smaller parameter counts. Training data is curated from broad sources to compensate for model size reductions.