showlab/Show-o
A single transformer architecture that unifies multimodal understanding and generation by combining LLMs with diffusion models.

Show-o is a research repository presenting a unified multimodal model that handles both comprehension and content generation in one transformer. The architecture integrates large language model capabilities with diffusion-based generation, enabling tasks spanning visual understanding (VQA, captioning) and image synthesis. The work represents advances in multimodal AI by eliminating separate encoder-decoder pipelines in favor of a single unified model.