FoundationVision/UniTok
A unified visual tokenizer that converts images into discrete tokens for use in autoregressive generation and multimodal understanding models.

UniTok is a unified visual tokenizer designed for both visual generation and understanding tasks. It provides discrete tokenization of images compatible with autoregressive generative models like LlamaGen and multimodal understanding models like LLaVA. The tokenizer supports unified multimodal LLMs including Chameleon and Liquid, enabling both image generation and comprehension within the same framework. It was published at NeurIPS 2025 as a Spotlight paper.