← all repositories

FoundationVision/Groma

Groma is a multimodal LLM that uses localized visual tokenization to enable region-level understanding and visual grounding capabilities.

Groma
Velocity · 7d
+0.8
★ / day
Trend
steady
star history

Groma is a grounded multimodal large language model that introduces visual tokenization for localization, allowing it to process user-defined region inputs (bounding boxes) and generate responses grounded to specific visual regions. It achieves state-of-the-art performance on referring expression comprehension benchmarks like RefCOCO, RefCOCO+, and RefCOCOg. The model is based on LLaMA architecture extended with multimodal capabilities for vision-language understanding and localization.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.