raoyongming/DynamicViT
A dynamic token sparsification framework for efficient vision transformers that prunes redundant tokens to reduce FLOPs by over 30% while maintaining accuracy.

DynamicViT introduces a progressive, input-dependent token pruning mechanism for vision transformers to achieve computational efficiency. The framework dynamically identifies and removes redundant tokens during inference, reducing FLOPs and increasing throughput across various vision transformer architectures including DeiT, ConvNeXt, and Swin Transformers. The approach is validated on image classification and extended to object detection and semantic segmentation tasks.