Google's TTS lab in a box, minus the official blessing
An unofficial Google project that wraps Festival and Merlin into a web UI so non-specialists can train synthetic voices on GCP.

What it does Voice Builder is a browser-based wrapper around two classic TTS engines—Festival and Merlin—that runs on Google Cloud Platform. You upload audio and text data, click a button, and wait 30–60 minutes for a deployable voice model you can test with a “hello” and a play button. The whole stack (Docker, Firebase, App Engine, Cloud Functions, Genomics Pipeline API) deploys via shell scripts.
The interesting bit
The project treats voice building as a batch job rather than a research artifact. It abstracts away the usual Festival/Merlin incantations by standardizing inputs through a JSON VoiceBuildingSpecification—lexicon paths, phonology, wavs, engine params—then hands that spec to either the built-in engines or a custom “data exporter” you can hook in to munge files first. That makes it feasible for linguists or language-preservation groups to iterate without becoming speech-processing hackers.
Key highlights
- Ships with pre-loaded public data from Google’s language-resources repo, including a Sinhala example
- Custom data exporter hook lets you transform lexicons or filter bad data before the TTS engine sees it
- All job artifacts land in GCS buckets; the UI polls job status until deployment
- Explicitly not an official Google product—disclaimed right at the top
- Published research backing at ai.google/research/pubs/pub46977
Caveats
- Deployment is a nine-step prerequisite slog across GCP, Firebase, gcloud, and Docker; one typo in
deploy.shand you’re debugging IAM roles - The Genomics Pipeline API dependency is a curious choice for TTS training and may date the architecture
- No candidate images provided in the repo, so you’re flying blind on UI polish
Verdict Worth a look if you’re building voices for low-resource languages and need a shared web interface for non-technical collaborators. Skip it if you want modern neural TTS or a local, dependency-light setup.