NVlabs/GroupViT
GroupViT is a transformer-based vision model that learns semantic segmentation from image-caption pairs without requiring mask annotations.

The model uses a hierarchical grouping mechanism to progressively coalesce visual tokens into semantically coherent regions. It learns text-image alignment through contrastive learning on image-caption datasets, enabling zero-shot segmentation of unseen object classes without pixel-level mask supervision. This approach bridges vision and language learning while eliminating the need for expensive annotated segmentation masks.