← all repositories

kyegomez/MultiModalMamba

Multi-modal deep learning model combining Vision Transformer and Mamba SSM architectures for concurrent text and image processing.

MultiModalMamba
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

MultiModalMamba implements a novel architecture fusing Vision Transformer (ViT) with Mamba state space models to create a high-performance multi-modal model. The architecture processes both text sequences and images concurrently, using transformer attention mechanisms alongside efficient Mamba layers for feature extraction and fusion. Built on Zeta, a minimalist PyTorch-based AI framework, the model provides a MultiModalMambaBlock component and a full trainable model for multi-modal tasks.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.