Neural networks that hallucinate missing audio frequencies
A research implementation for upsampling low-resolution audio using temporal feature-wise modulation, with a Keras layer you can steal for other time-series work.

What it does
This repo trains neural networks to reconstruct high-resolution audio from downsampled inputs—essentially teaching a model to guess the frequencies that were thrown away. It ships with data pipelines for the VCTK speech corpus, training scripts for single- or multi-speaker datasets, and a pre-trained checkpoint for speaker #1. The run.py script handles both training and inference, spitting out side-by-side low-res, high-res, and “predicted” WAV files.
The interesting bit
The authors’ Temporal FiLM (Feature-wise Linear Modulation) layer is the real takeaway: it captures long-range dependencies in sequences by modulating features across time, not just depth. They’ve packaged it as a standalone Keras layer (keras_layer.py), and the same architecture has been repurposed for EEG denoising and functional genomics imputation. The audio task is essentially a demo.
Key highlights
- Four model variants:
audiounet,audiotfilm(the authors’ pick for best),dnn, and a cubic-spline baseline - Single-speaker training takes “a few hours”; multi-speaker needs “several days”
- Pre-trained single-speaker model available via Google Drive link
- Input length must be a multiple of
2**layers; the model will silently clip your audio if it isn’t - Includes a grocery-sales imputation experiment, because why not
Caveats
- The authors explicitly warn that the codebase “has not been fully tested” after a recent TensorFlow/Keras upgrade
- Performance is highly sensitive to how you generate low-res training data—Butterworth vs. Chebyshev low-pass filters matter, and aliased input (no filter) actually sounds better despite worse objective metrics
- Applying this to your own voice requires collecting matching labeled examples; the pre-trained model is speaker-specific
Verdict
Worth a look if you need a proven time-series upsampling architecture you can adapt, or if you’re curious about FiLM-like conditioning for sequences. Skip it if you want a polished, drop-in audio enhancer—this is research code with sharp edges.