← all repositories

amazon-science/mm-cot

A multimodal language model framework that uses chain-of-thought reasoning by integrating vision features with text for science question answering.

mm-cot
Velocity · 7d
+3.3
★ / day
Trend
steady
star history

Multimodal-CoT implements a two-stage training framework for language models: first generating rationales, then inferring answers. The approach incorporates vision features extracted from ViT, CLIP, ResNet, and DETR encoders into a decoupled training architecture. It is evaluated on the ScienceQA dataset and demonstrates how visual information can enhance language model reasoning capabilities.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.