aimagelab/meshed-memory-transformer
A transformer architecture with memory augmentation for generating textual descriptions from images, published at CVPR 2020.

The Meshed-Memory Transformer (M2) is a deep learning model that generates captions for images using a modified transformer architecture. It introduces memory layers to enhance the model’s capacity for learning visual-semantic relationships. The model operates on pre-extracted detection features from a vision backbone and produces natural language descriptions through attention-based decoding. It was trained and evaluated on the COCO dataset.