Henry-23/VideoChat
Real-time voice-interactive digital human system supporting customizable appearance, voice, and cloning with sub-3s latency.

VideoChat is a real-time interactive digital human demo supporting both end-to-end and cascade architectures. The end-to-end approach uses multimodal LLMs (GLM-4-Voice) for direct speech-to-speech generation, while the cascade approach chains ASR (FunASR), LLM (Qwen), TTS (GPT-SoVITS/CosyVoice), and talking head generation (MuseTalk) pipelines. Users can customize the avatar appearance and voice characteristics, including voice cloning support.