FoundationVision/Liquid
Liquid is an autoregressive foundation model that unifies language and image understanding and generation within a single scalable architecture.

Liquid implements a scalable multi-modal generation paradigm using autoregressive language models as the unified backbone for both visual comprehension and text-to-image generation. The model extends traditional LLMs to handle multi-modal inputs and outputs, enabling tasks like text-to-image synthesis alongside visual question answering. Pretraining and evaluation scripts are provided along with hosted model weights on Hugging Face.