GANs that learn to sing, drum, and chirp in raw waveforms
WaveGAN treats audio generation like image synthesis, but skips the spectrogram middleman and works directly on one-dimensional waveforms.

What it does WaveGAN trains generative adversarial networks to synthesize raw audio waveforms—speech, drums, birdsong, piano—by learning from folders of MP3s, WAVs, or OGGs without preprocessing. The repo includes both the waveform approach and a SpecGAN baseline that generates spectrograms instead. You can train on arbitrary audio files, adjust sample rates, handle multi-channel audio, and generate clips up to 4 seconds at 16kHz.
The interesting bit Most audio GANs operate on 2D spectrograms because image-generation techniques transfer cleanly. WaveGAN keeps everything in the time domain, using one-dimensional convolutions analogous to DCGAN’s image approach. The v2 update added streaming data loading and longer generation lengths in response to what the authors call “common requests”—a polite way of saying the first version was a bit rough for real use.
Key highlights
- Train directly on audio files; no preprocessing required (v2 improvement)
- Supports variable sample rates, multi-channel audio, and clip lengths up to 65,536 samples
- Includes JavaScript generation code for browser-based demos (see the procedural drum machine)
- TensorBoard monitoring, automatic checkpoint backups, and Inception score evaluation for SC09
- Colab notebook provided for inference
Caveats
- Locked to TensorFlow 1.12.0 and Python 3; this is legacy code now
- Single-GPU training only; CPU training is “prohibitively slow and not recommended”
- GAN training “may occasionally collapse”—hence the hourly backup script
- SpecGAN comparison is limited to 1-second clips at 16kHz
Verdict Worth exploring if you’re researching generative audio or need a reproducible baseline for raw-waveform synthesis. Skip it if you want modern PyTorch, stable diffusion-based audio, or anything approaching production reliability.