FoundationVision/Groma
Groma is a multimodal LLM that uses localized visual tokenization to enable region-level understanding and visual grounding capabilities.

Groma is a grounded multimodal large language model that introduces visual tokenization for localization, allowing it to process user-defined region inputs (bounding boxes) and generate responses grounded to specific visual regions. It achieves state-of-the-art performance on referring expression comprehension benchmarks like RefCOCO, RefCOCO+, and RefCOCOg. The model is based on LLaMA architecture extended with multimodal capabilities for vision-language understanding and localization.