Yes — OpenMOSS/MOVA is open source, released under the Apache-2.0 license.

What language is MOVA written in?

OpenMOSS/MOVA is primarily written in Python.

OpenMOSS/MOVA has 1.1k stars on GitHub.

Where can I find MOVA?

OpenMOSS/MOVA is on GitHub at https://github.com/OpenMOSS/MOVA.

OpenMOSS/MOVA

Open-source video generation finally has something to say

MOVA is an open-source foundation model that generates synchronized video and audio in a single pass, sparing open-source video from its silent-film era.

★1.1k stars Python Image · Video · Audio Inference · Serving

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

MOVA synthesizes video and matched audio—speech, sound effects, room tone—in a single diffusion pass rather than generating a mute clip and tacking on sound afterward. Feed it a text prompt and a reference image, and it produces a short video where lips, speakers, and ambient audio stay aligned. The repository ships inference code, training pipelines, LoRA fine-tuning scripts, and a full evaluation suite with 11 metrics across 7 groups.

The interesting bit

The architecture keeps pre-trained video and audio towers separate but fuses them with bidirectional cross-attention, so each modality nudges the other during generation instead of one dominating the other. It is a rare fully open release—weights, benchmarks, and Arena-style evaluation sets included—competing on lip-sync benchmarks against closed rivals.

Key highlights

Native bimodal generation: one inference pass for both video and audio, which the team claims eliminates the error accumulation of cascaded pipelines.
Hardware flexibility: runs on H100s, RTX 4090s, or Ascend NPUs, with offloading strategies that trade VRAM for host RAM and step time.
Evaluation rigor: releases 11 metrics across 7 groups (audio quality, lip-sync, AV alignment), plus a 732-sample benchmark for Arena-style subjective evaluation.
Ecosystem hooks: offers SGLang integration for high-throughput serving and a ComfyUI node for workflow tinkerers.
Two resolutions: MOVA-360p and MOVA-720p, both supporting text-and-image conditioning.

Caveats

The hardware appetite is real: even the lighter offload strategy needs 12 GB of VRAM and roughly 77 GB of host RAM for an 8-second 360p clip, and step times stretch past 40 seconds on consumer GPUs.
The README claims “state-of-the-art” lip-sync scores but only provides detailed numbers for the Verse-Bench subset; the human-evaluation Elo charts are shown without extracted figures, so direct comparisons are hard to verify from the text alone.

Verdict

Researchers and builders who need open, synchronized audiovisual generation—especially for multilingual lip-sync or multi-speaker scenes—should look here. Casual users without a workstation GPU or patience for RAM offloading should stick to the hosted API or ComfyUI wrappers.

Frequently asked

What is OpenMOSS/MOVA?: MOVA is an open-source foundation model that generates synchronized video and audio in a single pass, sparing open-source video from its silent-film era.
Is MOVA open source?: Yes — OpenMOSS/MOVA is open source, released under the Apache-2.0 license.
What language is MOVA written in?: OpenMOSS/MOVA is primarily written in Python.
How popular is MOVA?: OpenMOSS/MOVA has 1.1k stars on GitHub.
Where can I find MOVA?: OpenMOSS/MOVA is on GitHub at https://github.com/OpenMOSS/MOVA.