← all repositories

jshilong/GPT4RoI

A vision-language model that enables large language models to understand and reason about spatial regions within images through instruction tuning.

555 stars Python Language ModelsComputer Vision
GPT4RoI
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

GPT4RoI instruction-tunes LLaMA to process region-of-interest inputs alongside natural language instructions, enabling spatial visual understanding. The model accepts bounding box coordinates and cropped image features as input, allowing users to query specific regions with natural language. Released weights combine delta weights with original LLaMA for the 7B variant, and the project includes training code, inference code, and a Gradio demo.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.