fixie-ai/ultravox
A multimodal LLM that extends open-weight models (Llama, Mistral, Gemma) with a projector enabling direct audio understanding for real-time voice AI.

Ultravox is a speech-capable multimodal LLM that processes audio directly without a separate ASR stage, converting audio into the high-dimensional space used by the underlying language model. This direct coupling allows faster responses than cascading ASR + LLM systems. The model builds on research from AudioLM, SeamlessM4T, and similar works, and versions have been trained on Llama 3, Mistral, and Gemma architectures.