deepglint/unicom
UNICOM is a large-scale vision transformer model designed as a visual backbone for multimodal large language models like LLaVA.

Velocity · 7d
+0.6
★ / day
Trend
→steady
star history
The repository provides foundational visual representation models trained at scale using LAION400M and COYO700M datasets. It implements sample-to-cluster contrastive learning to optimize vision encoders, and these models serve as the vision tower in multimodal LLM pipelines such as LLaVA-NeXT with Qwen2.5-7B. Benchmarks demonstrate strong performance across document understanding, chart analysis, and general VQA tasks.