Is VoiceCraft open source?

Yes — jasonppy/VoiceCraft is an open-source project tracked on heatdrop.

What language is VoiceCraft written in?

jasonppy/VoiceCraft is primarily written in Jupyter Notebook.

How popular is VoiceCraft?

jasonppy/VoiceCraft has 8.5k stars on GitHub.

Where can I find VoiceCraft?

jasonppy/VoiceCraft is on GitHub at https://github.com/jasonppy/VoiceCraft.

← all repositories

jasonppy/VoiceCraft

Clone a voice from seconds of audio, then edit what it says

VoiceCraft exists to remove the recording booth from speech editing and voice cloning: it can edit what someone says, or copy their voice, from just a few seconds of raw audio.

★8.5k stars Jupyter Notebook Image · Video · Audio Inference · Serving

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does VoiceCraft is a token infilling neural codec language model that performs zero-shot text-to-speech and speech editing on real-world audio like podcasts, audiobooks, and internet videos. Given just a few seconds of reference speech from an unseen speaker, it can clone the voice or edit specific words in an existing recording without re-recording. It treats speech as discrete codec tokens and uses infilling to regenerate or insert audio that matches the target text while preserving the original speaker.

The interesting bit Instead of generating waveforms directly, VoiceCraft works on compressed neural codec representations—essentially treating audio like a language model treats text tokens. The infilling mechanism lets you surgically replace words in an utterance while keeping the surrounding audio and speaker identity intact, which is harder than it sounds when the source is a noisy YouTube video.

Key highlights

Zero-shot voice cloning and speech editing from only a few seconds of reference audio
Trained on diverse “in-the-wild” data including podcasts, internet videos, and audiobooks
Available in 330M and 830M parameter variants, with TTS-enhanced finetunes hosted on HuggingFace
Multiple inference paths: HuggingFace Spaces, Google Colab, Docker, standalone CLI scripts (tts_demo.py, speech_editing_demo.py), and local Gradio
A recent switch from topp=1 to topk=40 sampling reportedly improved editing and TTS output significantly

Caveats

Efficiency improvements are still on the maintainers’ TODO list
Some finetuned models carry a hard 16-second ceiling on prompt plus generation length due to training compute constraints
Code and model weights are under non-commercial licenses (CC BY-NC-SA 4.0 and Coqui Public Model License 1.0.0)

Verdict Worth exploring if you need research-grade voice cloning or speech editing for non-commercial prototypes. Skip it if you need a commercially licensed, production-ready, or long-form TTS engine without runtime limits.

Frequently asked

What is jasonppy/VoiceCraft?: VoiceCraft exists to remove the recording booth from speech editing and voice cloning: it can edit what someone says, or copy their voice, from just a few seconds of raw audio.
Is VoiceCraft open source?: Yes — jasonppy/VoiceCraft is an open-source project tracked on heatdrop.
What language is VoiceCraft written in?: jasonppy/VoiceCraft is primarily written in Jupyter Notebook.
How popular is VoiceCraft?: jasonppy/VoiceCraft has 8.5k stars on GitHub.
Where can I find VoiceCraft?: jasonppy/VoiceCraft is on GitHub at https://github.com/jasonppy/VoiceCraft.