Yes — Phantom-video/HuMo is open source, released under the Apache-2.0 license.

What language is HuMo written in?

Phantom-video/HuMo is primarily written in Python.

Phantom-video/HuMo has 1.3k stars on GitHub.

Where can I find HuMo?

Phantom-video/HuMo is on GitHub at https://github.com/Phantom-video/HuMo.

Phantom-video/HuMo

ByteDance & Tsinghua’s 17B-parameter human video puppeteer

It generates controllable human videos by orchestrating text, reference images, and audio into a single diffusion model.

★1.3k stars Python Image · Video · Audio

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

HuMo is a diffusion-based video generation framework focused entirely on human subjects. It accepts combinations of text prompts, reference images, and audio to synthesize footage where characters match specified appearances, scenes, and lip movements. The release includes both a 17B-parameter flagship model and a 1.7B “lightweight” variant designed to squeeze into 32 GB of VRAM.

The interesting bit

Rather than bolting modalities onto a text-to-video backbone, HuMo weaves image and audio guidance directly into the diffusion process with configurable strength scales (scale_a, scale_t). The team also open-sourced HuMoSet, a 670K-sample dataset of open-source human videos annotated with dense Qwen2.5-VL captions and strictly synchronized audio, which underpins the model’s subject preservation and talking-head fidelity.

Key highlights

Dual checkpoints: 17B supports 480P and 720P; 1.7B renders 480P in roughly eight minutes on a 32 GB GPU.
Multimodal conditioning modes: text-audio, text-image-audio, and—according to the feature list—text-image, though the repo todo suggests text-image-only inference may still be pending.
Multi-GPU inference via FSDP plus sequence parallelism.
ComfyUI nodes available for both model sizes, plus an OpenBayes playground for cloud testing.
Training data (HuMoSet) is fully open-source, curated from public datasets with no proprietary company footage.

Caveats

The model is trained on 97-frame sequences at 25 FPS; pushing beyond that length degrades output, and a dedicated longer-generation checkpoint has not yet dropped.
The 1.7B model keeps audio-visual sync nearly intact, but the authors warn its visual quality is distinctly below the 17B version.
The inference todo list strikes through text-image input, creating ambiguity about whether that mode is actually shipping despite being advertised.

Verdict

Researchers and creators who need fine-grained, multimodal control over human video should take a close look. Casual users without high-end GPUs should wait or use the hosted playground—the local hardware bar is steep even for the small model.

Frequently asked

What is Phantom-video/HuMo?: It generates controllable human videos by orchestrating text, reference images, and audio into a single diffusion model.
Is HuMo open source?: Yes — Phantom-video/HuMo is open source, released under the Apache-2.0 license.
What language is HuMo written in?: Phantom-video/HuMo is primarily written in Python.
How popular is HuMo?: Phantom-video/HuMo has 1.3k stars on GitHub.
Where can I find HuMo?: Phantom-video/HuMo is on GitHub at https://github.com/Phantom-video/HuMo.