A VAE that actually scales to 256×256 faces without melting
NVIDIA's NeurIPS 2020 spotlight paper fixes the main reason variational autoencoders fall apart when you make them deep.

What it does
NVAE is a deep hierarchical variational autoencoder that trains likelihood-based generative models on images from MNIST up to 256×256 faces. The repo contains the official PyTorch implementation with training scripts for six datasets and a small zoo of hyperparameter presets.
The interesting bit
VAEs usually collapse when you stack too many layers because posterior inference becomes intractable. NVAE sidesteps this with a deep hierarchy of latent variables and some architectural elbow grease—normalizing flows, residual distributions, and a carefully designed encoder-decoder structure that keeps the variational lower bound tight even at depth.
Key highlights
- Reproduces Table 1 from the paper: exact training commands for MNIST, CIFAR-10, CelebA 64, ImageNet 32×32, CelebA-HQ 256, and FFHQ 256
- Multi-node training via
mpirunfor the larger models (up to 24 V100s) - LMDB conversion scripts provided for I/O efficiency on large datasets
- Smaller model variants trade ~0.01 bpd for fitting on 8 GPUs instead of 24
- PyTorch 1.6.0, Python 3.7
Caveats
- Training times are substantial: 21 hours for MNIST (2 GPUs) up to 160 hours for FFHQ 256 (24 GPUs)
- Hardware requirements are steep; the README assumes V100 clusters and
mpirunfamiliarity - No pre-trained checkpoints are mentioned in the README—you’re training from scratch
Verdict
Researchers working on likelihood-based generative modeling or VAE architecture design should grab this. If you need a quick off-the-shelf image generator or lack GPU clusters, look elsewhere.