Is Motus open source?

Yes — thu-ml/Motus is open source, released under the Apache-2.0 license.

What language is Motus written in?

thu-ml/Motus is primarily written in Python.

How popular is Motus?

thu-ml/Motus has 1.2k stars on GitHub.

Where can I find Motus?

thu-ml/Motus is on GitHub at https://github.com/thu-ml/Motus.

← all repositories

thu-ml/Motus

One 8B model for video, language, and robot control

Motus attempts to replace separate world models, VLAs, and video generators with one 8B-parameter stack.

★1.2k stars Python Domain Apps Inference · Serving

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Motus is an ~8 billion parameter model that combines a video generator (Wan2.2-5B), a vision-language model (Qwen3-VL-2B), and two smaller transformer experts for action and understanding. It can switch between several modes—world model, vision-language-action model, inverse dynamics model, video generator, or joint video-action predictor—depending on what you ask of it. The project ships with pretrained checkpoints and training code targeting both simulation (RoboTwin 2.0) and real-world robot arms.

The interesting bit

Instead of training actions from scratch, Motus uses optical flow to derive “latent actions,” treating pixel-level delta motion as a bridge between video prediction and robot control. A Mixture-of-Transformers architecture routes tasks to three experts, while a UniDiffuser-style scheduler lets the same weights behave like entirely different model types. The training recipe is a three-stage, six-layer data pyramid that mixes web video, synthetic data, and robot trajectories.

Key highlights

~8B total parameters: 5B video backbone, 2.13B VLM, 641M action expert, 253M understanding expert
Claims 87.02% success rate on RoboTwin 2.0 simulation, outperforming X-VLA and π₀.₅ baselines
Supports LeRobotDataset format and real-world embodiments including AC-One and Aloha-Agilex-2
Flexible mode switching: world model, VLA, inverse dynamics, video generation, or joint prediction
Requires serious hardware: >24 GB VRAM for inference, >80 GB for training

Caveats

Benchmark claims are limited to RoboTwin 2.0 simulation; no real-world success rates are reported in the README
Hardware requirements are steep—an RTX 5090 is the minimum for inference, and training demands A100 80GB-class GPUs or better
The README explicitly welcomes community help to maintain and extend the project

Verdict

Worth a look if you’re researching unified robot foundation models and have the GPU budget to match. Skip it if you need proven real-world benchmarks or a lightweight deployment.

Frequently asked

What is thu-ml/Motus?: Motus attempts to replace separate world models, VLAs, and video generators with one 8B-parameter stack.
Is Motus open source?: Yes — thu-ml/Motus is open source, released under the Apache-2.0 license.
What language is Motus written in?: thu-ml/Motus is primarily written in Python.
How popular is Motus?: thu-ml/Motus has 1.2k stars on GitHub.
Where can I find Motus?: thu-ml/Motus is on GitHub at https://github.com/thu-ml/Motus.