Self-hosted voice agents that can see, remember, and lip-sync
CyberVerse wires WebRTC, RAG, and optional real-time avatar video into a modular stack for building persistent AI companions.

What it does CyberVerse is a self-hosted framework for real-time, voice-first AI agents. It handles low-latency conversation over WebRTC (P2P or LiveKit SFU), persists character memory to disk, supports RAG over imported documents, and can optionally generate real-time digital-human video with lip-sync from a single reference photo. The stack runs as three services: a Python inference server, a Go API server, and a frontend.
The interesting bit The architecture splits “foreground” conversation flow from “background” work. A PersonaAgent keeps voice turns responsive and interruptible, while SubAgents handle slow tasks like research or report generation asynchronously. This prevents the awkward pause-while-thinking problem that kills immersion in voice interfaces.
Key highlights
- Voice mode works without any local GPU; flip
inference.avatar.enabledtofalseand it streams audio only - Supports visual input from user camera or screen share in standard/omni sessions
- Modular “brain, voice, hearing, tools, memory, face” stack swappable via YAML config and web UI at
/settings - Currently wires in Alibaba Qwen or Volcengine Doubao models and voice APIs
- Avatar backends: FlashHead (1.3B weights) or LiveAct, with vllm support for the latter
Caveats
- Setup is involved: Node 18+, Go 1.25, Conda, Python 3.10, FFmpeg, plus API keys for Chinese cloud providers (DashScope or Doubao)
- Avatar mode needs CUDA 12.8, PyTorch 2.8, and manual model weight downloads from Hugging Face or ModelScope
- README demos are example characters, not bundled with the project
Verdict Worth a look if you’re building persistent voice companions or digital-human interfaces and want full control over the pipeline. Skip it if you need a one-click SaaS or primarily English-centric TTS/ASR with no interest in Chinese model ecosystems.