Is soundstorm-pytorch open source?

Yes — lucidrains/soundstorm-pytorch is open source, released under the MIT license.

What language is soundstorm-pytorch written in?

lucidrains/soundstorm-pytorch is primarily written in Python.

How popular is soundstorm-pytorch?

lucidrains/soundstorm-pytorch has 1.5k stars on GitHub.

Where can I find soundstorm-pytorch?

lucidrains/soundstorm-pytorch is on GitHub at https://github.com/lucidrains/soundstorm-pytorch.

← all repositories

lucidrains/soundstorm-pytorch

SoundStorm in PyTorch: parallel audio by masking SoundStream tokens

This repo implements Google DeepMind’s SoundStorm, applying MaskGiT’s parallel demasking strategy to SoundStream residual vector-quantized codes for fast, non-autoregressive audio synthesis.

★1.5k stars Python Image · Video · Audio ML Frameworks

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

The project is a PyTorch port of SoundStorm, an efficient non-autoregressive audio generator from Google DeepMind. It consumes residual vector-quantized token sequences from a SoundStream codec and iteratively demasks them—using the Conformer architecture—to produce raw audio or semantic speech representations. When wired to a trained TextToSemantic transformer, it can function as a text-to-speech pipeline, though that upstream component remains a work in progress.

The interesting bit

The clever twist is borrowing MaskGiT’s image-generation playbook for audio: instead of sampling tokens left-to-right, the model starts with a fully masked sequence and unmasks the most confident predictions over a fixed schedule—18 steps by default. It swaps Shaw’s old relative positional embeddings for rotary embeddings and defaults to Flash Attention, which keeps the Conformer blocks lean.

Key highlights

Trains on pre-encoded SoundStream RVQ codes or raw audio passed through an integrated codec.
Uses Conformer blocks with rotary position embeddings and Flash Attention enabled by default.
Supports variable-length sequences via masking and grouped RVQ via concatenated embeddings.
Includes acoustic prompting and per-level quantizer decoding for finer control.
Ships with a Hugging Face Accelerate-based trainer and has been verified end-to-end by community contributors.

Caveats

Full text-to-speech is blocked by unfinished upstream work: spear-tts-pytorch still needs pretraining, pseudo-labeling, and backtranslation logic.
The todo list shows cross-attention, adaptive layernorm conditioning, and a command-line interface are not yet implemented.
The repository is explicitly labeled a work-in-progress.

Verdict

Audio researchers and TTS hackers who want a non-autoregressive, parallel decoder to pair with their own SoundStream or semantic encoder should look here. If you need a polished, batteries-included speech synthesis product, this is still too early.

Frequently asked

What is lucidrains/soundstorm-pytorch?: This repo implements Google DeepMind’s SoundStorm, applying MaskGiT’s parallel demasking strategy to SoundStream residual vector-quantized codes for fast, non-autoregressive audio synthesis.
Is soundstorm-pytorch open source?: Yes — lucidrains/soundstorm-pytorch is open source, released under the MIT license.
What language is soundstorm-pytorch written in?: lucidrains/soundstorm-pytorch is primarily written in Python.
How popular is soundstorm-pytorch?: lucidrains/soundstorm-pytorch has 1.5k stars on GitHub.
Where can I find soundstorm-pytorch?: lucidrains/soundstorm-pytorch is on GitHub at https://github.com/lucidrains/soundstorm-pytorch.