The 2016 paper that made object detection fully convolutional
R-FCN replaced per-region sub-networks with shared convolutions, cutting computation while keeping accuracy.

What it does R-FCN is a region-based object detector that runs almost entirely as fully-convolutional layers over the whole image. Instead of the expensive per-region processing that Fast/Faster R-CNN applied hundreds of times per image, it shares computation across all candidate regions. The result: object detection with ResNet backbones at ~0.09–0.17 seconds per image on a Titan X, scoring 77.4% mAP (ResNet-50) or 79.5% mAP (ResNet-101) on PASCAL VOC 2007.
The interesting bit The trick is making a fully-convolutional classifier work for detection at all. Classification wants translation invariance; detection needs translation variance to localize objects. R-FCN solves this with position-sensitive score maps—a set of conv filters that encode relative spatial information, pooled selectively for each region proposal. It is elegant in the way it forces the backbone to do the heavy lifting.
Key highlights
- NIPS 2016 implementation by Dai, Li, He, and Sun (the ResNet crew)
- Ships with a custom Caffe branch and pre-trained ResNet-50/101 weights
- Supports both selective-search and RPN proposals; includes OHEM training scripts
- Training runs ~13–19 hours on a Titan X depending on backbone
- MIT licensed
Caveats
- The authors themselves recommend using the newer Deformable R-FCN in MXNet instead, calling it “significantly” more accurate with “very low extra computational overhead”
- Requires MATLAB 2014a+ and specific NVIDIA GPUs (Titan, K40, K80); Windows users get pre-built Caffe mex files, Linux users must compile
- This repo is essentially frozen research code; a Python reimplementation exists but lives elsewhere
Verdict Worth studying if you care about the design lineage from R-CNN to modern single-shot detectors, or if you need to reproduce a 2016 baseline exactly. Skip it if you want something maintained, production-ready, or Python-native.