NVlabs/describe-anything
A large multimodal model that generates detailed captions for arbitrary regions of images or video frames.

Velocity · 7d
+3.5
★ / day
Trend
→steady
star history
Describe Anything Model (DAM) takes region annotations (points, boxes, scribbles, masks) on images or video frames and outputs detailed textual descriptions of those regions. For videos, a single frame annotation suffices. The project includes a new evaluation benchmark (DLC-Bench) to assess models on the detailed localized captioning task.