← all repositories
AdolfVonKleist/Phonetisaurus

Speech recognition's missing dictionary generator

Trains models that guess how words sound, because you can't ship a pronunciation dictionary for every proper noun the user will invent.

516 stars Shell Data ToolingOther AI
Phonetisaurus
Velocity · 7d
+0.1
★ / day
Trend
steady
star history

What it does

Phonetisaurus builds grapheme-to-phoneme (G2P) models: feed it a dictionary of known word-to-pronunciation mappings, and it learns to generate pronunciations for words it has never seen. The output is a weighted finite-state transducer (WFST) in OpenFst format, the same representation used by Kaldi and other speech toolkits. It ships as C++ binaries with optional Python 3 bindings for extracting scores, alignments, and raw lattices.

The interesting bit

The project treats G2P as a joint n-gram modeling problem over aligned grapheme-phoneme sequences, then compiles the result into an FST. This is the old-school, pre-neural approach—fast, compact, and interpretable, with a lineage tracing back to INTERSPEECH papers and the original Google Code era. The README still references git-lfs archives of those historical releases.

Key highlights

  • End-to-end training pipeline: align lexicon, estimate n-gram model, convert to WFST
  • Wrapper scripts (phonetisaurus-train, phonetisaurus-apply) hide the OpenFst plumbing
  • Supports n-best output, probability mass filtering, and greedy decoding
  • Optional Python bindings expose per-multigram scores and alignments
  • Docker images available; tested build path for Ubuntu 20.04 + OpenFst 1.7.2

Caveats

  • Requires manual OpenFst installation and LD_LIBRARY_PATH wrangling; not a pip install experience
  • Python bindings need pybindgen and a manual .so copy step that feels circa 2010
  • The phonetisaurus-g2prnn binary exists but the README offers no usage details—unclear if RNN support is first-class or vestigial

Verdict

Worth a look if you’re maintaining a Kaldi-based ASR pipeline or need a lightweight, self-contained G2P module without dragging in PyTorch. Skip it if you want state-of-the-art neural G2P out of the box; this is the reliable sedan, not the self-driving car.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.