hustvl/YOLOS
A vision transformer model adapted for object detection without task-specific architectural modifications, published at NeurIPS 2021.

YOLOS demonstrates that vanilla Vision Transformers pre-trained on image classification can transfer to object detection by adding detection tokens and using a set-based Hungarian matching loss. The project studies the transferability of ImageNet-pretrained ViTs to the COCO detection benchmark, including experiments with self-supervised MoCo-v3 pre-training. The implementation is integrated into HuggingFace Transformers for easy use.