facebookresearch/TimeSformer
TimeSformer is a transformer-based model for video classification using space-time attention that achieves state-of-the-art results on action recognition benchmarks.

This repository provides the official PyTorch implementation of the TimeSformer model for video understanding. The model uses a transformer architecture with space-time attention to process video sequences, treating each frame as a separate patch and attending across both spatial and temporal dimensions. Pretrained models are provided for Kinetics-400, Kinetics-600, Something-Something-V2, and HowTo100M datasets.