lucidrains/x-clip
A PyTorch implementation of CLIP, a multi-modal model that learns to associate images with text using contrastive learning.

The repository provides a complete implementation of CLIP from OpenAI with additional experimental improvements from recent research papers. It includes support for fine-grained contrastive learning (FILIP), decoupled contrastive learning (DCL), extra latent projections (CLOOB), visual self-supervised learning, and masked language modeling (MLM) on text. The implementation allows configuring text and image encoders with customizable depth, heads, and patch sizes.