A no-nonsense image captioning workhorse
A PyTorch research codebase that bundles modern captioning techniques without the framework bloat.

What it does
This is a training and evaluation toolkit for image captioning on COCO and Flickr30k. It covers the full pipeline: data preprocessing, training with cross-entropy or self-critical reinforcement learning, and inference with greedy decoding, sampling, or beam search. A Colab demo notebook and pretrained model zoo are provided for quick evaluation.
The interesting bit
The author has kept this repo in sync with a separate self-critical training implementation, so the RL refinement method from the 2017 Rennie et al. paper is available as a first-class citizen rather than an afterthought. It also supports bottom-up attention features and a Transformer captioning model, which puts it a step ahead of older neuraltalk2 descendants.
Key highlights
- Self-critical sequence training with CIDEr score caching for RL-based fine-tuning
- Bottom-up attention features and Transformer models supported alongside older architectures
- Distributed multi-GPU training via PyTorch Lightning (see
ADVANCED.md) - YAML config files plus command-line overrides for experiment management
- TensorBoard logging and HTML visualization interface for generated captions
Caveats
- GPU-only: no CPU training or inference option currently exists, and the author notes there’s “no point using cpus to train”
- Raw image evaluation doesn’t work for bottom-up feature models; you need precomputed features
- Live demo not implemented; “welcome pull request”
Verdict
Worth a look for researchers or practitioners who need a solid, technique-current captioning baseline and don’t mind some manual data preparation. Skip it if you want a polished API or CPU/mobile deployment.