Atten4Vis/ConditionalDETR
A transformer-based object detection model that achieves 6.7-10x faster training convergence than standard DETR on COCO.

This repository implements Conditional DETR, an object detection model that modifies the transformer decoder’s cross-attention mechanism. The key innovation is a conditional spatial query that narrows each attention head to a specific image region, reducing dependence on high-quality content embeddings and easing training difficulty. The model integrates with Huggingface Transformers and achieves state-of-the-art convergence speeds on COCO 2017 validation.