← all repositories
deepseek-ai/Janus

One model, two faces: DeepSeek's Janus reads and draws images

A unified transformer that handles both multimodal understanding and text-to-image generation without splitting into separate specialist models.

Janus
Velocity · 7d
+30
★ / day
Trend
steady
star history

What it does

Janus is a family of vision-language models that can both look at images and answer questions about them, and generate new images from text prompts. The latest iteration, Janus-Pro, scales up to 7B parameters with expanded training data and an optimized training strategy. JanusFlow experiments with a different approach, marrying autoregressive language modeling with rectified flow for generation. All variants share a single transformer backbone rather than bolting together separate understanding and generation pipelines.

The interesting bit

The original Janus decouples visual encoding into separate pathways for understanding versus generation, then feeds both into one unified transformer. This sidesteps the usual tension where a single vision encoder gets pulled in opposite directions—needing fine-grained features for captioning versus coarse, structure-friendly representations for image synthesis. JanusFlow pushes further by showing rectified flow can be trained inside a standard LLM framework without heavy architectural surgery.

Key highlights

  • Four model sizes available: Janus-1.3B, JanusFlow-1.3B, Janus-Pro-1B, and Janus-Pro-7B
  • Sequence length of 4096 tokens across all variants
  • MIT license for code; model weights under a separate Model Agreement that permits commercial use
  • Quick-start inference code provided for both multimodal understanding and text-to-image generation
  • Online demos hosted on Hugging Face for hands-on testing

Caveats

  • The README notes a bug fix in tokenizer configuration that previously broke classifier-free guidance and degraded visual generation quality; older checkpoints or third-party ports may still carry this issue
  • Inference examples require trust_remote_code=True and custom model classes, so this is not a drop-in transformers replacement
  • Janus-Pro’s technical report is linked as a local PDF rather than an arXiv preprint, making independent verification of benchmark claims less straightforward

Verdict

Worth a look if you’re building multimodal applications and want one model that handles both directions of image-text interaction. Skip if you need battle-tested, widely integrated APIs—this is research-grade code with custom model definitions.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.