← all repositories

NVlabs/GroupViT

GroupViT is a transformer-based vision model that learns semantic segmentation from image-caption pairs without requiring mask annotations.

GroupViT
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

The model uses a hierarchical grouping mechanism to progressively coalesce visual tokens into semantically coherent regions. It learns text-image alignment through contrastive learning on image-caption datasets, enabling zero-shot segmentation of unseen object classes without pixel-level mask supervision. This approach bridges vision and language learning while eliminating the need for expensive annotated segmentation masks.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.