jshilong/GPT4RoI
A vision-language model that enables large language models to understand and reason about spatial regions within images through instruction tuning.

GPT4RoI instruction-tunes LLaMA to process region-of-interest inputs alongside natural language instructions, enabling spatial visual understanding. The model accepts bounding box coordinates and cropped image features as input, allowing users to query specific regions with natural language. Released weights combine delta weights with original LLaMA for the 7B variant, and the project includes training code, inference code, and a Gradio demo.