Is voicebox-pytorch open source?

Yes — lucidrains/voicebox-pytorch is open source, released under the MIT license.

What language is voicebox-pytorch written in?

lucidrains/voicebox-pytorch is primarily written in Python.

How popular is voicebox-pytorch?

lucidrains/voicebox-pytorch has 699 stars on GitHub.

Where can I find voicebox-pytorch?

lucidrains/voicebox-pytorch is on GitHub at https://github.com/lucidrains/voicebox-pytorch.

← all repositories

lucidrains/voicebox-pytorch

A PyTorch Voicebox that fixes Meta’s paper, then tells you to skip it

It rebuilds Meta’s Voicebox in PyTorch, corrects a few architectural oversights, and then tells you to use E2 TTS instead.

★699 stars Python Image · Video · Audio Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This is a PyTorch port of Voicebox, MetaAI’s text-to-speech system built on continuous normalizing flows and conditional flow matching. It trains a neural network to generate audio tokens guided by text or semantic conditioning, using an ODE solver to map noise to speech. The repo wraps the VoiceBox module with a ConditionalFlowMatcherWrapper and can chain into Spear-TTS for semantic token conditioning.

The interesting bit

The author quietly fixes two paper-level oversights: swapping ALiBi for rotary embeddings (noting ALiBi fails cleanly on bidirectional models) and replacing the authors’ time-embedding concatenation with adaptive RMSNorm, borrowing from the Paella paper. That level of surgical correction is typical of lucidrains repos, but the blunt “just use E2 TTS” disclaimer in the README is less typical and refreshingly honest.

Key highlights

Implements conditional flow matching with torchdiffeq and torchode backends for the generative ODE.
Integrates with Spear-TTS semantic tokens and Encodec/Voco audio codec for end-to-end raw-audio training.
Replaces the paper’s time embedding with adaptive RMSNorm; uses rotary position embeddings instead of ALiBi.
Includes an Accelerate-based trainer; Lucas Newman contributed Spear-TTS conditioning code and reports convergence better than SoundStorm.
Supports both text-conditional and unconditional training modes.

Caveats

The author explicitly recommends using the E2 TTS repository instead of this one.
The MelVoco encoder currently reconstructs audio at an incorrect length, and the NS2 aligner class still needs cleanup before duration-predictor training is ready.
Several TODOs remain unfinished, including mapping audio frames to seconds for sampling.

Verdict

Tinkerers who want to study flow-matching TTS or need a hackable PyTorch reference to Meta’s architecture should pull this apart. Anyone looking for a maintained, production-ready speech pipeline should take the author’s advice and look at E2 TTS instead.

Frequently asked

What is lucidrains/voicebox-pytorch?: It rebuilds Meta’s Voicebox in PyTorch, corrects a few architectural oversights, and then tells you to use E2 TTS instead.
Is voicebox-pytorch open source?: Yes — lucidrains/voicebox-pytorch is open source, released under the MIT license.
What language is voicebox-pytorch written in?: lucidrains/voicebox-pytorch is primarily written in Python.
How popular is voicebox-pytorch?: lucidrains/voicebox-pytorch has 699 stars on GitHub.
Where can I find voicebox-pytorch?: lucidrains/voicebox-pytorch is on GitHub at https://github.com/lucidrains/voicebox-pytorch.