OFA-Sys/OFA
OFA is a unified sequence-to-sequence pretrained model that bridges vision, language, and cross-modal tasks including image captioning, VQA, and text-to-image generation.

OFA (ICML 2022) is a multimodal foundation model unified through a sequence-to-sequence learning framework. It supports both English and Chinese and handles diverse tasks including image captioning (ranked 1st on MSCOCO leaderboard), visual question answering, visual grounding, text-to-image synthesis, and text/image classification. The repository provides pretrained checkpoints, step-by-step pretraining and finetuning instructions, and supports both standard finetuning and prompt tuning approaches.