Object detection without the hand-crafted pipeline
Facebook Research swaps anchors and NMS for a Transformer and bipartite matching, matching Faster R-CNN in 50 lines of PyTorch.

What it does DETR treats object detection as a direct set prediction problem. A Transformer encoder-decoder looks at the whole image, compares learned object queries against global context, and spits out final bounding boxes in parallel—no region proposals, no non-maximum suppression, no anchor grids to tune.
The interesting bit The trick is the loss function: bipartite matching between predictions and ground truth forces each query to claim exactly one object (or no-object). It is the same idea as machine translation, except the “sentence” is a fixed set of boxes and the “vocabulary” is COCO classes plus coordinates.
Key highlights
- Matches Faster R-CNN + ResNet-50 at 42 AP on COCO with half the FLOPs
- Inference demo runs in 50 lines of PyTorch; full training code is just
main.pyplus model definitions - Pretrained models load via
torch.hubin one line - Extends to panoptic segmentation by freezing the box model and training a mask head on top
- Includes Detectron2 wrapper in
d2/for refugees from that ecosystem
Caveats
- Training is not cheap: 6 days on 8× V100 for the full 300-epoch schedule (28 min/epoch)
- DC5 models are sensitive to batch size at evaluation; use 1 image per GPU or AP drops
- Segmentation requires a two-stage training recipe: boxes first, masks second
Verdict Worth studying if you want to understand how Transformers ate computer vision, or if you are tired of maintaining anchor-generation spaghetti. Skip if you need real-time edge inference or a plug-and-play training pipeline for custom datasets.