← all repositories
edobashira/speech-language-processing

A 2,200-star map to the speech/NL tooling wilderness

A curated list that catalogs finite-state transducers, language models, and speech recognizers so you don't have to hunt them down yourself.

speech-language-processing
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does This repo is a manually maintained index of open-source tools and datasets across speech and natural language processing. Categories include finite-state toolkits, language modeling libraries, speech recognizers, signal processing utilities, text-to-speech systems, speech corpora, and machine translation frameworks. Each entry gets a one-line description and a link.

The interesting bit The list leans hard into the classical, toolkit-heavy side of NLP—weighted finite-state transducers, Hidden Markov Models, and n-gram smoothing—making it a useful time capsule of the pre-transformer toolchain. The maintainer’s personal favorites (“my personal favourite LM toolkit”) and occasional dead links give it the flavor of an actual researcher’s bookmarks folder rather than SEO content.

Key highlights

  • Covers niche tooling rarely aggregated elsewhere: OpenFst wrappers, WFST decoders, Pitman-Yor process libraries, segmental CRF toolkits
  • Includes hard-to-find speech datasets (LibriSpeech, TED-LIUM, CMUdict) alongside software
  • Entries span multiple decades and maintenance statuses, from actively developed (Kaldi) to explicitly unmaintained (MIT FST Toolkit)
  • Sub-categories are alphabetized, which helps browsing but doesn’t prioritize by relevance or freshness

Caveats

  • Several links point to defunct hosting (Google Code, raw .zip files on personal sites) with no archival fallback noted
  • No clear criteria for inclusion or deprecation; some descriptions are copied from project homepages without verification
  • Machine Translation section is truncated in the source, so coverage there is incomplete

Verdict Worth bookmarking if you’re maintaining legacy speech pipelines, researching historical NLP approaches, or need a starting point for comparing finite-state libraries. Skip it if you want modern neural-only stacks or actively curated, annotated guidance.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.