← all repositories
facebookresearch/detr

Object detection without the hand-crafted pipeline

Facebook Research swaps anchors and NMS for a Transformer and bipartite matching, matching Faster R-CNN in 50 lines of PyTorch.

15.3k stars Python Computer VisionML Frameworks
detr
Velocity · 7d
+6.9
★ / day
Trend
steady
star history

What it does DETR treats object detection as a direct set prediction problem. A Transformer encoder-decoder looks at the whole image, compares learned object queries against global context, and spits out final bounding boxes in parallel—no region proposals, no non-maximum suppression, no anchor grids to tune.

The interesting bit The trick is the loss function: bipartite matching between predictions and ground truth forces each query to claim exactly one object (or no-object). It is the same idea as machine translation, except the “sentence” is a fixed set of boxes and the “vocabulary” is COCO classes plus coordinates.

Key highlights

  • Matches Faster R-CNN + ResNet-50 at 42 AP on COCO with half the FLOPs
  • Inference demo runs in 50 lines of PyTorch; full training code is just main.py plus model definitions
  • Pretrained models load via torch.hub in one line
  • Extends to panoptic segmentation by freezing the box model and training a mask head on top
  • Includes Detectron2 wrapper in d2/ for refugees from that ecosystem

Caveats

  • Training is not cheap: 6 days on 8× V100 for the full 300-epoch schedule (28 min/epoch)
  • DC5 models are sensitive to batch size at evaluation; use 1 image per GPU or AP drops
  • Segmentation requires a two-stage training recipe: boxes first, masks second

Verdict Worth studying if you want to understand how Transformers ate computer vision, or if you are tired of maintaining anchor-generation spaghetti. Skip if you need real-time edge inference or a plug-and-play training pipeline for custom datasets.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.