← all repositories

tue-mps/eomt

Encoder-only Mask Transformer (EoMT) repurposes a plain Vision Transformer for joint image patch and segmentation query encoding, achieving competitive segmentation accuracy without task-specific decoder components.

593 stars Jupyter Notebook Computer Vision
eomt
Velocity · 7d
+1.1
★ / day
Trend
steady
star history

EoMT is a minimalist image segmentation model that converts a standard Vision Transformer into a unified architecture for image and video segmentation tasks. The model encodes both image patches and segmentation queries as tokens within the plain ViT, eliminating the need for adapters or task-specific decoders. Leveraging large-scale pretrained ViTs, EoMT achieves accuracy comparable to state-of-the-art methods while being significantly faster, demonstrating up to 4× speedup with ViT-L over more complex approaches.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.