lyuchenyang/Macaw-LLM
Multi-modal LLM combining vision, audio, and text processing for unified language modeling.

Velocity · 7d
+1.4
★ / day
Trend
→steady
star history
Macaw-LLM is a research project developing multi-modal language modeling capabilities by integrating images, videos, audio, and text into a unified system. The architecture leverages pre-trained components including CLIP for visual understanding, Whisper for audio processing, and LLaMA as the base language model. This enables the model to process and reason across multiple modalities within a language modeling framework.