CompVis/taming-transformers
Generative model for high-resolution image synthesis using VQGAN with autoregressive transformer composition.

Taming Transformers introduces a convolutional VQGAN approach that learns a codebook of context-rich visual parts, whose composition is modeled with an autoregressive transformer. The method combines the efficiency of convolutional approaches with the expressivity of transformers for high-resolution image synthesis. The repository provides training code and pretrained models for class-conditional ImageNet synthesis, achieving state-of-the-art FID scores among autoregressive approaches.