A speech quality rater that tells you *why* your call sounds awful
NISQA scores phone and synthetic speech quality across five dimensions, not just one blunt MOS number.

What it does NISQA is a deep-learning speech quality predictor that works without a clean reference track. Feed it a degraded audio file and it returns an overall quality score plus four specific culprits: Noisiness, Coloration, Discontinuity, and Loudness. A separate model variant, NISQA-TTS, rates how natural synthetic speech from TTS or voice conversion systems sounds.
The interesting bit Most quality metrics give you a single number and shrug. NISQA v2.0 breaks down why quality suffered, which is genuinely useful for debugging codecs, network glitches, or pipeline choices. It also ships as a configurable training framework—swap CNNs for LSTMs, add self-attention, go double-ended if you have reference audio—controlled entirely through YAML files.
Key highlights
- Pre-trained weights for transmitted speech (NISQA v2.0) and synthesized speech (NISQA-TTS v1.0)
- Single-ended and double-ended prediction modes
- Modular architecture: CNN/DFF → Self-Attention/LSTM → various pooling strategies
- Includes a corpus of 14,000+ labeled speech samples with real-world degradation (Zoom, Skype, mobile, packet loss)
- Fine-tuning and transfer learning supported via CSV + YAML workflow
Caveats
- Model weights are CC BY-NC-SA 4.0—non-commercial use only
- The Wiki is referenced repeatedly but marked “not yet added” in the README
- Stereo files require manual channel selection; no automatic downmixing mentioned
Verdict Worth a look if you build VoIP pipelines, evaluate TTS output, or need to train custom perceptual metrics. Skip it if you need a fully open commercial license or a polished, documented API.