← all repositories

TIGER-AI-Lab/VLM2Vec

A vision-language model trained for unified multimodal embedding across images, videos, and visual documents.

VLM2Vec
Velocity · 7d
+1.1
★ / day
Trend
steady
star history

VLM2Vec-V2 is a unified framework for learning multimodal embeddings across diverse visual formats. It trains vision-language models using contrastive learning to produce representations for image retrieval, video retrieval, and visual document understanding. The project includes MMEB-V2, a comprehensive benchmark with 78 tasks for evaluating embedding models across these modalities.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.