facebookresearch/mmf
A modular PyTorch framework for vision-and-language multimodal research and model development from Facebook AI Research.

MMF (Multimodal Framework) provides reference implementations of state-of-the-art vision-and-language models for tasks including visual question answering, image captioning, and multimodal dialog. Built on PyTorch with support for distributed training, it serves as both a research framework and a starter codebase for competitions like Hateful Memes, TextVQA, TextCaps, and VQA challenges. The framework is designed to be un-opinionated, scalable, and fast for bootstrapping new multimodal research projects.