Transformer meets Mandarin speech: 12.8% CER, one neural net
A from-scratch PyTorch port of the Speech Transformer paper, wired for end-to-end Chinese ASR with Kaldi doing the feature grunt work.

What it does
Takes acoustic features and spits out Mandarin characters through a single Transformer network — no separate acoustic model, no language model bolted on the side. The repo wraps training, decoding, and even Visdom loss plotting into one shell script (run.sh) that stages through data prep, feature extraction, training, and decoding.
The interesting bit
The author didn’t just port the paper; they glued it to the Kaldi ecosystem for feature extraction while keeping the neural bits pure PyTorch. That hybrid approach — Kaldi for MFCCs, Transformer for everything else — was a pragmatic bridge during the 2018-2019 transition when end-to-end ASR was still proving itself against pipeline systems.
Key highlights
- Single-network end-to-end: acoustic features → characters, no intermediate phoneme representation
- AIShell-1 recipe included: download the dataset, tweak one path,
bash run.sh - Training resumption and Visdom visualization baked into the runner
- CER of 12.8% on AIShell-1, competitive with LAS (13.2%) though trailing LSTMP (9.85%)
- PyTorch 0.4.1+ era code — expect some archaeology if you’re on modern torch
Caveats
- Kaldi dependency is mandatory, not optional — feature extraction is outsourced entirely
- The 12.8% CER lags behind the LSTMP baseline in the same table, so the “attention is all you need” sales pitch doesn’t quite close the deal on this dataset
- PyTorch 0.4.1+ requirement suggests significant bit-rot risk; no commits visible since ~2019
Verdict
Worth a look if you’re studying how Transformer ASR was adapted for Mandarin or need a reference implementation of the Zhao et al. ICASSP 2019 paper. Skip it if you want production-ready tooling — the field has moved to Conformer, wav2vec 2.0, and friends.