toverainc/willow-inference-server
An open-source inference server for running Whisper-based ASR, TTS, and LLM models locally with WebRTC support.

Willow Inference Server is a self-hosted language inference system that serves Whisper for speech recognition, TTS for speech synthesis, and LLM models (llama, vicuna). It uses CTranslate2 for optimized Whisper inference and supports multiple transports including WebRTC for real-time streaming, REST, and WebSockets. The server is memory-optimized to load multiple Whisper models and TTS simultaneously within 6GB VRAM and targets CUDA GPUs ranging from consumer to datacenter cards.