Is GroupViT open source?

Yes — NVlabs/GroupViT is an open-source project tracked on heatdrop.

What language is GroupViT written in?

NVlabs/GroupViT is primarily written in Python.

How popular is GroupViT?

NVlabs/GroupViT has 788 stars on GitHub.

Where can I find GroupViT?

NVlabs/GroupViT is on GitHub at https://github.com/NVlabs/GroupViT.

← all repositories

NVlabs/GroupViT

Pixel grouping from web text, no masks required

A vision transformer that learns semantic segmentation purely from image captions, never seeing a human-drawn pixel mask during training.

★788 stars Python Computer Vision Image · Video · Audio

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does GroupViT is a research implementation of a hierarchical vision transformer that learns semantic segmentation from noisy web-scale image-text pairs. It performs bottom-up grouping of visual regions based on semantic similarity, then transfers zero-shot to standard segmentation benchmarks like Pascal VOC and COCO. The repo provides training scripts, pre-trained checkpoints, and Gradio demos, but the bulk of the README is devoted to wrangling massive datasets into webdataset shards.

The interesting bit Instead of relying on costly pixel-accurate mask annotations, the model figures out spatial grouping through hierarchical attention, essentially bootstrapping segmentation structure from weak text supervision. It is a neat trick: the captions provide the semantic labels, and the architecture learns to cluster pixels accordingly.

Key highlights

Zero-shot segmentation on Pascal VOC, Pascal Context, and COCO after pre-training only on image-text pairs (GCC, YFCC, RedCaps).
Hierarchical grouping mechanism built into the transformer layers, not a post-hoc add-on.
Pre-trained weights and interactive demos (Hugging Face Spaces, Colab) are provided, though weights live in a separate fork.
Built on a stack of webdataset, mmsegmentation, and timm for scalable training and evaluation.
Benchmark numbers are published for two training configurations, topping out at 52.3 mIoU on Pascal VOC for the GCC+YFCC model.

Caveats

The dependency stack is frozen in time: Python 3.7, PyTorch 1.8, and specific legacy versions of mmcv/mmsegmentation, suggesting this is a research snapshot rather than an actively maintained library.
Reproducing training requires downloading and sharding multiple massive web datasets (GCC3M, YFCC14M, RedCaps12M), which is a heavy lift.
The README trails off mid-sentence in the multi-node training section, and pre-trained weights are hosted externally rather than in this repository.

Verdict Worth a look if you are researching weakly supervised or zero-shot segmentation and need a baseline to compare against. Skip it if you want a modern, plug-and-play segmentation model with up-to-date dependencies.

Frequently asked

What is NVlabs/GroupViT?: A vision transformer that learns semantic segmentation purely from image captions, never seeing a human-drawn pixel mask during training.
Is GroupViT open source?: Yes — NVlabs/GroupViT is an open-source project tracked on heatdrop.
What language is GroupViT written in?: NVlabs/GroupViT is primarily written in Python.
How popular is GroupViT?: NVlabs/GroupViT has 788 stars on GitHub.
Where can I find GroupViT?: NVlabs/GroupViT is on GitHub at https://github.com/NVlabs/GroupViT.