shenyunhang/APE
A foundational vision-language model for universal visual perception that detects and segments objects using natural language descriptions across diverse domains.

APE aligns visual and language representations through an innovative prompting mechanism to enable universal visual perception. The model performs open-world object detection, instance segmentation, semantic segmentation, and referring expression comprehension using a single unified architecture. It achieves competitive performance across 160 datasets by leveraging vision-language transformer foundations with thousands of vocabulary entries and language descriptions for flexible perception in varied scenarios.