peteanderson80/bottom-up-attention
A bottom-up attention model based on Faster R-CNN with ResNet-101 that extracts salient image region features for visual question answering and image captioning.

This repository provides code for training a bottom-up attention model using multi-GPU Faster R-CNN with ResNet-101 backbone, trained on Visual Genome object and attribute annotations. The pretrained model generates spatial features for salient image regions that can replace traditional CNN features in attention-based image captioning and VQA systems. The approach achieved state-of-the-art performance on MSCOCO captioning (CIDEr 117.9, BLEU_4 36.9) and won the 2017 VQA Challenge with 70.3% overall accuracy.