A thousand stars for teaching neural nets to caption images
This repo implements self-critical reinforcement learning for image captioning, plus the kitchen sink of training tricks.

What it does
A PyTorch research codebase for training image captioning models on COCO or Flickr30k. It covers the full pipeline: data prep, cross-entropy pretraining, self-critical RL fine-tuning, and evaluation with standard metrics (BLEU, METEOR, CIDEr). There’s also a simple HTML visualizer for browsing results in a browser.
The interesting bit
The self-critical sequence training is the headline feature: after 30 epochs of standard training, switching to RL with CIDEr as reward reportedly pushes scores to ~1.05. The author also quietly added DistributedDataParallel via pytorch-lightning and a Transformer captioning model, making this more of a living toolkit than a one-paper reproduction.
Key highlights
- Supports self-critical RL, bottom-up attention features, test-time ensembling, and Transformer architectures
- YAML configs plus command-line overrides for training; TensorBoard logging built in
- Pretrained models available; evaluation works on raw image folders or standard splits
- Colab demo notebook provided for quick experimentation
- Can install as editable pip package if the raw scripts misbehave
Caveats
- No CPU support at all; author notes “there’s no point using cpus to train” and CPU inference needs a custom request
- Raw-image evaluation explicitly doesn’t work for bottom-up feature models
- Live demo not implemented; “welcome pull request”
Verdict
Worth a look if you’re doing image captioning research and want a battle-tested PyTorch base with RL training already wired up. Skip it if you need a production API or CPU inference — this is a training rig, not a product.