TIGER-AI-Lab/VLM2Vec
A vision-language model trained for unified multimodal embedding across images, videos, and visual documents.

Velocity · 7d
+1.1
★ / day
Trend
→steady
star history
VLM2Vec-V2 is a unified framework for learning multimodal embeddings across diverse visual formats. It trains vision-language models using contrastive learning to produce representations for image retrieval, video retrieval, and visual document understanding. The project includes MMEB-V2, a comprehensive benchmark with 78 tasks for evaluating embedding models across these modalities.