← all repositories
greyblake/whatlang-rs

Trigrams, not transformers: a 70-language detector in pure Rust

A lightweight language identification library that skips neural networks for n-grams and still lands in production search engines.

1.1k stars Rust Other AI
whatlang-rs
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

Whatlang takes a string of text and returns the detected language, script (Latin, Cyrillic, etc.), and a confidence score with a reliability flag. It covers 70 languages, runs entirely in Rust, and exposes a single detect() call that returns structured info you can assert against in tests.

The interesting bit

The library bets on trigram frequency models—Cavnar and Trenkle ‘94 vintage—rather than the neural networks that power Google’s CLD3. Its reliability heuristic is a hyperbola plotted across “unique trigram count” and “gap between top two language scores,” which is both transparent and slightly retro in the best way.

Key highlights

  • 100% Rust, no C++ bindings or external dependencies to wrangle
  • Recognizes script separately from language (useful for mixed or ambiguous text)
  • is_reliable() flag based on a tunable threshold, not just raw confidence
  • Used by Meilisearch and Sonic as a downstream dependency
  • Ports exist for Go, Python, Ruby, Elixir, and C via FFI

Caveats

  • 70 languages vs. 83–107 for CLD2/CLD3; coverage gaps may matter for some use cases
  • No HTML parsing support (CLD2 handles this; Whatlang does not)
  • UTF-8 only; unclear behavior on legacy encodings

Verdict

Good fit if you need fast, embeddable language detection in a Rust stack and can live with 70 languages. Skip it if you need HTML-aware parsing or the last dozen obscure languages—and don’t expect transformer-grade accuracy on very short or noisy inputs.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.