hustvl/EVF-SAM
A multimodal model that segments objects in images based on text prompts by fusing vision and language representations early.

This repository implements EVF-SAM, a Segment Anything Model extended with early vision-language fusion for text-prompted referring image segmentation. The model processes images together with text descriptions to output segmentation masks for the described regions. It extends both the original SAM and SAM-2 architectures to support textual grounding, enabling users to specify what to segment using natural language rather than visual prompts.