lucidrains/transfusion-pytorch
A PyTorch library implementing MetaAI's Transfusion, a multi-modal model that jointly performs next-token language prediction and image generation using flow matching.

This library provides a PyTorch implementation of the Transfusion architecture, which unifies autoregressive language modeling with continuous diffusion-based generation in a single transformer model. It handles mixed sequences of text tokens and continuous modality representations (such as images encoded as latents), enabling training on interleaved text-image data. The implementation supports classifier-free guidance for improved generation quality and can be extended to arbitrary numbers of modalities.