baaivision/tokenize-anything
A promptable vision foundation model for segmenting, recognizing, and captioning arbitrary regions with flexible visual prompts.

Tokenize Anything via Prompting (TAP) is a unified model that simultaneously performs open-world segmentation, recognition, and captioning using visual prompts such as points, boxes, and sketches. The model is trained on exhaustive segmentation masks from SA-1B combined with semantic priors from EVA-CLIP, a 5-billion parameter vision-language model. It provides a modular design with decoupled components and predictors for flexible integration.