Trigrams, not transformers: a 70-language detector in pure Rust
A lightweight language identification library that skips neural networks for n-grams and still lands in production search engines.

What it does
Whatlang takes a string of text and returns the detected language, script (Latin, Cyrillic, etc.), and a confidence score with a reliability flag. It covers 70 languages, runs entirely in Rust, and exposes a single detect() call that returns structured info you can assert against in tests.
The interesting bit
The library bets on trigram frequency models—Cavnar and Trenkle ‘94 vintage—rather than the neural networks that power Google’s CLD3. Its reliability heuristic is a hyperbola plotted across “unique trigram count” and “gap between top two language scores,” which is both transparent and slightly retro in the best way.
Key highlights
- 100% Rust, no C++ bindings or external dependencies to wrangle
- Recognizes script separately from language (useful for mixed or ambiguous text)
is_reliable()flag based on a tunable threshold, not just raw confidence- Used by Meilisearch and Sonic as a downstream dependency
- Ports exist for Go, Python, Ruby, Elixir, and C via FFI
Caveats
- 70 languages vs. 83–107 for CLD2/CLD3; coverage gaps may matter for some use cases
- No HTML parsing support (CLD2 handles this; Whatlang does not)
- UTF-8 only; unclear behavior on legacy encodings
Verdict
Good fit if you need fast, embeddable language detection in a Rust stack and can live with 70 languages. Skip it if you need HTML-aware parsing or the last dozen obscure languages—and don’t expect transformer-grade accuracy on very short or noisy inputs.