← all repositories
vllm-project/vllm-omni

vLLM grows eyes, ears, and a paintbrush

The text-only inference engine now serves models that generate images, video, audio, and speech—without forgetting how to write.

vllm-omni
Velocity · 7d
+18
★ / day
Trend
steady
star history

What it does vLLM-Omni extends the popular vLLM inference engine to handle “omni-modality” models: anything-to-anything systems that ingest text, images, video, or audio and output any of those formats in return. It keeps vLLM’s KV-cache optimizations for autoregressive text generation and adds pipelined execution for diffusion transformers and other non-autoregressive architectures.

The interesting bit The architecture is fully disaggregated: an “OmniConnector” dynamically allocates resources across stages, so a model that needs to understand speech, plan a response, and render a video doesn’t bottleneck on a single GPU queue. The project also ships an OpenAI-compatible API server, which means you can (theoretically) drop it into existing tooling without rewriting clients.

Key highlights

  • Supports text, image, video, and audio in both directions—input and output
  • Adds DiT (Diffusion Transformer) and parallel generation models to vLLM’s autoregressive core
  • Distributed inference with tensor, pipeline, data, and expert parallelism
  • Hardware coverage: CUDA, ROCm, MUSA, NPU, and XPU backends
  • Rebased on upstream vLLM v0.16.0 as of the 0.16.0 release
  • Community skill packs available for Cursor, Claude, and Codex integration

Caveats

  • The README is light on concrete latency or throughput numbers; claims “fast” and “high throughput” but points to the paper for actual benchmarks
  • Model support list is currently dominated by Qwen-family models (Qwen-Omni, Qwen-Image, Qwen3-TTS) plus Bagel, MiMo-Audio, and GLM-Image—breadth is still growing
  • “Seamless integration” is the project’s phrasing; your mileage with non-HuggingFace model formats is unclear

Verdict Worth a look if you’re already running vLLM and need to serve multimodal models in production. Skip it if you’re looking for a standalone training framework or need deep customization of non-Qwen model architectures.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.