facebookresearch/sam-audio
A foundation model from Meta for isolating arbitrary sounds in audio mixtures using natural language, visual, or temporal prompts.

Velocity · 7d
+13
★ / day
Trend
→steady
star history
SAM-Audio is a multimodal audio processing model that separates specific sounds from complex audio mixtures based on prompt inputs. It leverages a Perception-Encoder Audio-Visual (PE-AV) backbone to enable cross-modal understanding. Users can query audio by describing desired sounds in text, providing visual cues from video, or specifying time spans for extraction.