← all repositories
kuleshov/audio-super-res

Neural networks that hallucinate missing audio frequencies

A research implementation for upsampling low-resolution audio using temporal feature-wise modulation, with a Keras layer you can steal for other time-series work.

1.3k stars Python Image · Video · Audio
audio-super-res
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does

This repo trains neural networks to reconstruct high-resolution audio from downsampled inputs—essentially teaching a model to guess the frequencies that were thrown away. It ships with data pipelines for the VCTK speech corpus, training scripts for single- or multi-speaker datasets, and a pre-trained checkpoint for speaker #1. The run.py script handles both training and inference, spitting out side-by-side low-res, high-res, and “predicted” WAV files.

The interesting bit

The authors’ Temporal FiLM (Feature-wise Linear Modulation) layer is the real takeaway: it captures long-range dependencies in sequences by modulating features across time, not just depth. They’ve packaged it as a standalone Keras layer (keras_layer.py), and the same architecture has been repurposed for EEG denoising and functional genomics imputation. The audio task is essentially a demo.

Key highlights

  • Four model variants: audiounet, audiotfilm (the authors’ pick for best), dnn, and a cubic-spline baseline
  • Single-speaker training takes “a few hours”; multi-speaker needs “several days”
  • Pre-trained single-speaker model available via Google Drive link
  • Input length must be a multiple of 2**layers; the model will silently clip your audio if it isn’t
  • Includes a grocery-sales imputation experiment, because why not

Caveats

  • The authors explicitly warn that the codebase “has not been fully tested” after a recent TensorFlow/Keras upgrade
  • Performance is highly sensitive to how you generate low-res training data—Butterworth vs. Chebyshev low-pass filters matter, and aliased input (no filter) actually sounds better despite worse objective metrics
  • Applying this to your own voice requires collecting matching labeled examples; the pre-trained model is speaker-specific

Verdict

Worth a look if you need a proven time-series upsampling architecture you can adapt, or if you’re curious about FiLM-like conditioning for sequences. Skip it if you want a polished, drop-in audio enhancer—this is research code with sharp edges.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.