← all repositories
NVIDIA-NeMo/Speech

NVIDIA's speech toolkit pivots hard to audio-only

NeMo shed its multimodal skin to focus on ASR, TTS, and speech LLMs—just as the field gets interesting.

Speech
Velocity · 7d
+22
★ / day
Trend
accelerating
star history

What it does

NeMo Speech is NVIDIA’s PyTorch framework for building and deploying speech models: automatic speech recognition, text-to-speech synthesis, and speech-centric LLMs. It ships with pre-trained checkpoints and aims to get researchers from idea to trained model without rebuilding the plumbing each time.

The interesting bit

The repo recently amputated its broader multimodal limbs—vision, general LLM training, and other modalities moved out or to v2.7.0. What’s left is a tighter audio-only toolkit with some genuinely slick recent releases: Parakeet-unified does offline and streaming ASR in one checkpoint with 160ms minimum latency, and Nemotron VoiceChat promises full-duplex, interruptible conversation. The pivot is either disciplined focus or a retreat, depending on your cynicism level.

Key highlights

  • Streaming ASR with latency-accuracy tradeoffs selectable at inference time (Nemotron-Speech-Streaming)
  • Multilingual TTS covering nine languages in MagpieTTS v2602
  • Record-setting 5.63% WER on English Open ASR Leaderboard from Canary-Qwen-2.5B
  • Apache 2.0 licensed, with open-weight checkpoints hosted on HuggingFace
  • Requires Python 3.12+, PyTorch 2.6+, and an NVIDIA GPU for training

Caveats

  • The repository is mid-transformation; the first post-split stable release isn’t expected until June 2026, and the current stable version lives in an NGC container
  • Some older model checkpoints may need weights_only=False on PyTorch 2.6+, which carries arbitrary code execution risks with untrusted files

Verdict

Worth a look if you’re doing serious speech research or production ASR/TTS on NVIDIA hardware. Skip it if you need the broader multimodal toolkit—that’s in maintenance mode elsewhere.

Frequently asked

What is NVIDIA-NeMo/Speech?
NeMo shed its multimodal skin to focus on ASR, TTS, and speech LLMs—just as the field gets interesting.
Is Speech open source?
Yes — NVIDIA-NeMo/Speech is open source, released under the Apache-2.0 license.
What language is Speech written in?
NVIDIA-NeMo/Speech is primarily written in Python.
How popular is Speech?
NVIDIA-NeMo/Speech has 17.6k stars on GitHub and is currently accelerating.
Where can I find Speech?
NVIDIA-NeMo/Speech is on GitHub at https://github.com/NVIDIA-NeMo/Speech.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.