lucidrains/CoCa-pytorch
A PyTorch implementation of CoCa, a multimodal image-text foundation model combining contrastive and generative training objectives.

This repository provides a complete implementation of the CoCa (Contrastive Captioner) architecture, which jointly trains an image encoder and text decoder using both contrastive and captioning losses. The model uses a vision transformer for image encoding and a PaLM-style decoder for text generation, enabling image-text understanding and captioning capabilities. It supports cross-attention layers for multimodal fusion and achieves state-of-the-art accuracy on ImageNet benchmarks.