open-mmlab/Multimodal-GPT
Multi-modal chatbot that processes visual and language instructions, based on the OpenFlamingo vision-language model.

Velocity · 7d
+1.3
★ / day
Trend
→steady
star history
This repository trains a multi-modal chatbot by fine-tuning the OpenFlamingo architecture on both visual and language instruction datasets. It creates training data across VQA, image captioning, visual reasoning, text OCR, and visual dialogue tasks. The project performs joint training of visual and language instructions to improve model performance on multi-modal instruction following.