← all repositories
Robbyant/lingbot-world

An open-source world model that runs fast enough to play

LingBot-World turns a single image and a text prompt into an interactive, minute-long simulated world at 16 FPS with under one second of latency.

lingbot-world
Velocity · 7d
+30
★ / day
Trend
steady
star history

What it does

LingBot-World is an image-to-video world simulator built on top of Wan2.2. Feed it a still image, a text description, and optionally camera poses or action strings, and it generates extended video sequences while maintaining visual consistency across hundreds of frames. The project ships three model variants: a base camera-pose version, an action-controlled version with simplified keyboard-style commands, and a fast variant that uses chunked causal inference with KV caching for real-time interaction.

The interesting bit

The “Fast” model is the unusual part. Instead of generating all frames at once, it processes video chunk-by-chunk with KV caching, which is what lets it hit sub-second latency at 16 FPS — actually usable for interactive applications rather than batch rendering. The project also explicitly targets the open-source vs. closed-source gap, which is a nice change from the usual researchware that stops at the paper.

Key highlights

  • Three model tiers: Base (Cam), Base (Act) with keyboard-style action strings, and Fast for real-time use
  • Supports 480P and 720P output; up to ~961 frames (about a minute at 16 FPS) on sufficient GPU memory
  • Control via camera poses (OpenCV format), action strings like w-10,a-10,d-10, or no control signals at all
  • Community-provided 4-bit quantized model available for inference on limited VRAM
  • Apache 2.0 licensed with weights on HuggingFace and ModelScope

Caveats

  • Requires multi-GPU setup for the reference configurations (8 GPUs in the examples); single-GPU users will need the quantized model or significant patience
  • Built on Wan2.2, so you’re inheriting whatever installation friction that brings — flash-attention compilation, torch >= 2.4.0, etc.
  • The 4-bit quantized model explicitly warns of “minor degradation in visual fidelity and temporal consistency”

Verdict

Worth a look if you’re building interactive world simulators, game environments, or robot learning visualizers and need something actually open-weights. Skip it if you’re hoping for a lightweight single-GPU toy — this is still very much a workstation-or-cloud project.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.