collabora/WhisperFusion
A real-time conversational AI system combining Whisper speech-to-text with the Mistral LLM, optimized with TensorRT for low-latency voice interactions.

WhisperFusion is a real-time speech-to-text pipeline that feeds transcribed audio directly into the Mistral Large Language Model, enabling seamless voice conversations with AI. The system uses Nvidia TensorRT-LLM to optimize both the Whisper model and the LLM for inference, and leverages torch.compile on WhisperSpeech for additional performance gains. It requires a GPU with at least 24GB RAM, typically an RTX 4090, to achieve the desired low-latency real-time performance.