← all repositories
QwenLM/Qwen-Image

Alibaba's 20B image model actually reads the text you give it

An open-source diffusion model that renders Chinese and English text accurately enough to generate posters, comics, and infographics from prompts.

8k stars Python Image · Video · Audio
Qwen-Image
Velocity · 7d
+26
★ / day
Trend
steady
star history

What it does

Qwen-Image is a 20-billion-parameter MMDiT foundation model for text-to-image generation and image editing. It comes in several variants: the base model for generation, monthly-tuned versions (2512, 2.0), an editing pipeline (Edit-2511), and a layered composition model. The project provides HuggingFace diffusers pipelines, ComfyUI nodes, and online demos through Qwen Chat.

The interesting bit

Most diffusion models mangle text like a bad tattoo artist. Qwen-Image’s unusual strength is rendering complex text—especially Chinese—with correct typography and layout. The 2.0 version claims 1k-token instruction support for direct infographic generation (PPTs, posters, comics), effectively turning prompt engineering into page layout. The edit variant accepts multiple input images and instruction-following prompts for consistent modifications.

Key highlights

  • Native 2K resolution support in recent versions; aspect-ratio presets baked into the pipeline API
  • Day-0 inference acceleration from multiple projects: vLLM-Omni, LightX2V (claiming 42.55× speedup via distillation), SGLang-Diffusion, and LeMiCa
  • Broad hardware support beyond NVIDIA: Hygon, Metax, Ascend, and Cambricon accelerators
  • LoRA ecosystem emerging (e.g., MajicBeauty) with ModelScope hosting
  • Benchmark claims on T2I-CoreBench and AI Arena, though specific scores aren’t listed in the README

Caveats

  • The README warns of “performance misalignments” in earlier Edit versions; identity preservation and instruction following depend on using the latest diffusers commit
  • Requires bleeding-edge dependencies: transformers>=4.51.3 and a git install of diffusers from source
  • The 20B parameter count and 2K resolution imply substantial VRAM requirements; CPU fallback exists but is likely impractical for serious use

Verdict

Worth a look if you generate images with embedded text, work with Chinese-language content, or need open-source alternatives to closed editing APIs. Skip if you’re GPU-poor or need a lightweight, stable dependency chain—this is a research-grade model with research-grade setup friction.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.