cheerss/CrossFormer
CrossFormer++ is a vision transformer enabling cross-scale attention for object detection, instance segmentation, and semantic segmentation.

This repository contains PyTorch implementations of CrossFormer and CrossFormer++, versatile vision transformer architectures designed to build attention across features of different scales. The core innovations include Cross-scale Embedding Layer (CEL) and Long-Short Distance Attention (L/SDA) modules. The implementation supports multiple vision tasks including classification, object detection with Mask-RCNN and Cascade Mask-RCNN, instance segmentation, and semantic segmentation, with pretrained models across Small, Base, Large, and Huge variants.