← all repositories
alphacep/vosk

Speech recognition by brute-force memory: a 100,000-hour database of audio chunks

VOSK skips neural network training in favor of storing every audio chunk it has ever seen, then fingerprint-matches new input against the hoard.

vosk
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

VOSK is a speech recognition system that sidesteps conventional neural network training. Instead of learning abstract patterns, it slices audio into chunks, hashes them with LSH (locality-sensitive hashing), and stores them in a massive indexed database. Decoding becomes a lookup problem: match incoming audio chunks against the hoard, see what phones they corresponded to historically, and use that to guide recognition decisions.

The interesting bit

The core bet is that memorization beats generalization—or at least matches it, given how brittle neural ASR can be on unseen conditions. The README is admirably frank about this: the index “is really huge, it is not expected to fit a memory of single server.” The payoff is lifelong learning by simple accretion: add more audio to the database, recognition improves, no retraining required.

Key highlights

  • Trains on 100,000 hours of speech on “very simple hardware” (their claim, not benchmarked)
  • Supports correction by direct sample addition—no model retraining
  • Parallelizes across thousands of nodes
  • Currently requires Kaldi for initial audio segmentation and phone alignment
  • Python tooling for indexing (index.py) and verification (verify.py) against the database
  • Explicitly designed to complement neural methods, not replace them entirely

Caveats

  • The index is memory-prohibitive for single-server deployment
  • Segmentation still depends on conventional ASR (Kaldi), so it’s not fully self-contained
  • Generalization capabilities are “quite questionable”—the authors’ own words
  • README describes future work (multilingual training, mobile model reduction, custom segmentation) with no timeline or evidence of progress

Verdict

Worth a look if you’re researching lifelong learning, retrieval-based ASR, or need a system where human-auditable correction is more important than compact model size. Skip it if you need production speech recognition today—this is experimental infrastructure with Kaldi as a hard dependency and no stated accuracy benchmarks against standard test sets.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.