openai/CLIP
OpenAI's CLIP is a multimodal neural network trained on image-text pairs that performs zero-shot image classification given natural language queries.

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on diverse image-text pairs that can predict relevant text snippets for any given image without task-specific fine-tuning. The model learns visual concepts from natural language descriptions, enabling zero-shot transfer to downstream tasks. It achieves competitive accuracy with ResNet50 on ImageNet without using any of the original labeled training examples.