Browser avatars that actually move their lips on time
A JavaScript class for real-time lip-sync with full-body 3D avatars, built on Three.js and used in everything from AI dating profiles to Twitch adventures.

What it does
TalkingHead is a browser-based JavaScript class that drops a 3D avatar into a webpage and makes it speak with real-time lip-sync. It handles full-body GLB avatars, Mixamo FBX animations, and can translate emojis into facial expressions. The rendering is plain Three.js/WebGL — no magic, just geometry moving on cue.
The interesting bit
The project has accumulated a genuinely weird and impressive portfolio of real-world uses: MIT/Harvard dating-profile digital twins, a Cannes-featured Twitch game, quantum physics lectures, and cancer clinical trial recruitment. The author seems mildly surprised by this themselves. The lip-sync engine is modular — five built-in languages (English, German, French, Finnish, Lithuanian), but you can plug in Microsoft Azure for 100+ languages or bypass text entirely with the companion HeadAudio module for audio-driven visemes.
Key highlights
- Real-time lip-sync from TTS word-level timestamps or direct viseme/blend-shape data
- Supports Google Cloud TTS by default; ElevenLabs, Azure, and in-browser Kokoro via HeadTTS add-on
- Companion modules: HeadTTS (free neural TTS with WebGPU), HeadAudio (audio-driven lip-sync without transcription), MotionEngine (LLM-driven gestures)
- Dynamic bones and built-in physics for hair/clothing rigged avatars
- Minimal hobbyist example: single HTML file, add your Google API key, done
Caveats
- Avatars need a Mixamo-compatible rig plus ARKit and Oculus viseme blend shapes — not a drop-in-any-model situation
- The README warns against putting Google TTS API keys in client-side code, then immediately offers a minimal example that does exactly that; production use requires JWT/proxy setup
- Default language is Finnish (
"fi-FI"), which is charming but may confuse first-time users
Verdict
Grab this if you’re building browser-based AI interfaces, virtual presenters, or interactive characters and need lip-sync without Unity/Unreal overhead. Skip it if you want plug-and-play with arbitrary 3D models or need native mobile performance.