ViTAE-Transformer/Remote-Sensing-RVSA
A vision transformer foundation model for remote sensing image analysis achieving state-of-the-art results on object detection, semantic segmentation, and aerial scene classification benchmarks.

This repository implements a plain vision transformer adapted for remote sensing imagery, advancing foundation model capabilities for aerial and satellite image understanding. It supports multiple downstream vision tasks including object detection in aerial images, semantic segmentation, and scene classification across benchmark datasets like DOTA, DIOR-R, UCM, and AID. The approach includes self-supervised pretraining and transfer learning strategies to leverage large-scale unlabeled remote sensing data.