Teaching detectors to gossip about objects
Microsoft Research's CVPR 2018 work replaces hand-tuned NMS with learned object-to-object attention.

What it does
This is the official MXNet implementation of Relation Networks for Object Detection, a Faster R-CNN variant that adds an attention-based “Relation Module” between detected objects. Instead of suppressing overlapping boxes with hard-coded NMS, the model learns to let objects attend to each other—refining classification and localization jointly. It also experiments with a learned NMS replacement trained end-to-end.
The interesting bit
The core insight is treating object detection as a set problem rather than a collection of independent predictions. The relation module uses scaled dot-product attention (the same mechanism that would later dominate NLP) to let each detected box gather context from all others, weighted by geometric and appearance features.
Key highlights
- Achieves 35.2 mAP on COCO minival with standard Faster R-CNN (up from 31.8 baseline), and 38.6 mAP with FPN
- Includes a full “Learn NMS” variant that drops traditional non-maximum suppression entirely
- Built on Deformable ConvNets; requires MXNet 1.1.0 with deformable convolution operators
- Provides ten pre-configured experiment YAMLs covering Faster R-CNN, Deformable R-CNN, and FPN variants
- Pretrained models available via OneDrive and BaiduYun (with extraction passwords)
Caveats
- Python 2.7 only; Python 3 requires manual code modification
- Pinned to MXNet 1.1.0 (2018); the README warns that newer versions may break and recommends checking out that exact commit
- Model downloads require navigating OneDrive or BaiduYun with passwords, not a simple
wget - FPN experiments need pre-generated RPN proposals downloaded separately in two parts
Verdict
Worth studying if you’re researching attention mechanisms in vision or historical NMS alternatives. Skip it for production use—the stack is too dated, and modern detectors (DETR, DINO) have absorbed these lessons into cleaner architectures.