amazon-science/mm-cot
A multimodal language model framework that uses chain-of-thought reasoning by integrating vision features with text for science question answering.

Velocity · 7d
+3.3
★ / day
Trend
→steady
star history
Multimodal-CoT implements a two-stage training framework for language models: first generating rationales, then inferring answers. The approach incorporates vision features extracted from ViT, CLIP, ResNet, and DETR encoders into a decoupled training architecture. It is evaluated on the ScienceQA dataset and demonstrates how visual information can enhance language model reasoning capabilities.