IDEA-Research/DINO-X-API
A state-of-the-art vision model for open-world object detection and understanding that supports text prompts, visual prompts, and multi-level semantic outputs.

DINO-X provides a unified vision foundation model achieving SOTA zero-shot detection results on COCO, LVIS-minival and LVIS-val benchmarks. The model accepts text prompts, visual prompts, and customized prompts as input, and produces bounding boxes, segmentation masks, pose keypoints, and object captions through multiple perception heads. It focuses on open-set detection, enabling recognition of objects beyond predefined categories.