← all repositories
ruotianluo/ImageCaptioning.pytorch

A no-nonsense image captioning workhorse

A PyTorch research codebase that bundles modern captioning techniques without the framework bloat.

1.5k stars Python Computer VisionLanguage Models
ImageCaptioning.pytorch
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does

This is a training and evaluation toolkit for image captioning on COCO and Flickr30k. It covers the full pipeline: data preprocessing, training with cross-entropy or self-critical reinforcement learning, and inference with greedy decoding, sampling, or beam search. A Colab demo notebook and pretrained model zoo are provided for quick evaluation.

The interesting bit

The author has kept this repo in sync with a separate self-critical training implementation, so the RL refinement method from the 2017 Rennie et al. paper is available as a first-class citizen rather than an afterthought. It also supports bottom-up attention features and a Transformer captioning model, which puts it a step ahead of older neuraltalk2 descendants.

Key highlights

  • Self-critical sequence training with CIDEr score caching for RL-based fine-tuning
  • Bottom-up attention features and Transformer models supported alongside older architectures
  • Distributed multi-GPU training via PyTorch Lightning (see ADVANCED.md)
  • YAML config files plus command-line overrides for experiment management
  • TensorBoard logging and HTML visualization interface for generated captions

Caveats

  • GPU-only: no CPU training or inference option currently exists, and the author notes there’s “no point using cpus to train”
  • Raw image evaluation doesn’t work for bottom-up feature models; you need precomputed features
  • Live demo not implemented; “welcome pull request”

Verdict

Worth a look for researchers or practitioners who need a solid, technique-current captioning baseline and don’t mind some manual data preparation. Skip it if you want a polished API or CPU/mobile deployment.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.