kyegomez/MultiModalMamba
Multi-modal deep learning model combining Vision Transformer and Mamba SSM architectures for concurrent text and image processing.

MultiModalMamba implements a novel architecture fusing Vision Transformer (ViT) with Mamba state space models to create a high-performance multi-modal model. The architecture processes both text sequences and images concurrently, using transformer attention mechanisms alongside efficient Mamba layers for feature extraction and fusion. Built on Zeta, a minimalist PyTorch-based AI framework, the model provides a MultiModalMambaBlock component and a full trainable model for multi-modal tasks.