wilson1yan/VideoGPT
VideoGPT is a video generation model that uses VQ-VAE for discrete latent representations and a GPT-like transformer architecture for autoregressive generation.

VideoGPT is a generative model for video that employs VQ-VAE to learn downsampled discrete latent representations of raw video using 3D convolutions and axial self-attention. A GPT-like transformer architecture autoregressively models these discrete latents with spatio-temporal position encodings. The model generates video samples competitive with state-of-the-art GAN models and high-fidelity images from datasets like UCF-101 and TGIF.