← all repositories

baaivision/tokenize-anything

A promptable vision foundation model for segmenting, recognizing, and captioning arbitrary regions with flexible visual prompts.

602 stars Jupyter Notebook Image · Video · AudioLanguage Models
tokenize-anything
Velocity · 7d
+0.7
★ / day
Trend
steady
star history

Tokenize Anything via Prompting (TAP) is a unified model that simultaneously performs open-world segmentation, recognition, and captioning using visual prompts such as points, boxes, and sketches. The model is trained on exhaustive segmentation masks from SA-1B combined with semantic priors from EVA-CLIP, a 5-billion parameter vision-language model. It provides a modular design with decoupled components and predictors for flexible integration.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.