← all repositories

lucidrains/CoCa-pytorch

A PyTorch implementation of CoCa, a multimodal image-text foundation model combining contrastive and generative training objectives.

CoCa-pytorch
Velocity · 7d
+0.8
★ / day
Trend
steady
star history

This repository provides a complete implementation of the CoCa (Contrastive Captioner) architecture, which jointly trains an image encoder and text decoder using both contrastive and captioning losses. The model uses a vision transformer for image encoding and a PaLM-style decoder for text generation, enabling image-text understanding and captioning capabilities. It supports cross-attention layers for multimodal fusion and achieves state-of-the-art accuracy on ImageNet benchmarks.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.