A 14B video model that outruns 1.3B rivals by ignoring the playbook
Helios generates minute-scale videos at 19.5 FPS on one H100 by deliberately skipping every standard acceleration and anti-drift trick in the book.

What it does
Helios is a 14B-parameter diffusion model for text-to-video, image-to-video, video-to-video, and interactive generation. It synthesizes minute-long videos at 19.5 FPS end-to-end on a single H100 (about 10 FPS on Ascend NPU), and the authors claim it outperforms smaller 1.3B models in quality while doing so.
The interesting bit
The model achieves this speed without KV-cache, causal masking, sparse attention, TinyVAE, quantization, or any conventional anti-drifting strategy like keyframe sampling or error-banks. The authors frame this as a feature, not a bug: they found optimizations that improve throughput and cut memory enough to fit four 14B models in 80 GB of VRAM, running at image-diffusion-scale batch sizes during training.
Key highlights
- Three model variants: Helios-Base (best quality, v-prediction), Helios-Mid (intermediate checkpoint with CFG-Zero*), and Helios-Distilled (best efficiency, x0-prediction with custom DMD scheduler)
- Day-0 inference support across Diffusers, SGLang-Diffusion, vLLM-Omni, and Ascend NPU
- VRAM can squeeze down to ~6 GB with Group Offloading; multi-GPU inference via Ulysses/Ring/Unified Attention context parallelism
- Community-tested up to 20.89 FPS on tuned H100 hardware
- Gradio demo and AOTI-compiled HuggingFace Spaces available
Caveats
- Image-to-Video and Video-to-Video are noted as “slightly inferior” to Text-to-Video because training was T2V-first; the README suggests workarounds like
is_skip_first_chunkand noise-sigma tuning - Helios-Mid is explicitly flagged as an intermediate distillation checkpoint that “may not meet expected quality”
- Real-time performance depends heavily on CPU, system memory, and CUDA driver version, not just GPU
Verdict
Worth a look if you’re building video generation pipelines and skeptical that bigger always means slower. Skip it if you need polished I2V/V2V out of the box without parameter tweaking.