A language detector that actually reads your tweets
Rust library that identifies 75 languages from single words up to long documents, no neural networks or API calls required.

What it does
Lingua-rs tells you what language a piece of text is written in. It handles 75 languages, works offline, and needs no configuration. The author built it because existing Rust detectors choke on short snippets like tweets or single words, and get worse the more languages you ask them to consider.
The interesting bit
It mixes rule-based logic with statistical Naive Bayes methods, but deliberately skips neural networks, word dictionaries, and external APIs. The author trained models on one million sentences per language from Leipzig University’s Wortschatz corpora, then tested on held-out website data. The README includes extensive box plots comparing Lingua against CLD2, Whatlang, and Whichlang — with a careful apples-to-apples breakdown showing that Whichlang only supports 16 languages, which flatters its accuracy numbers until you level the playing field.
Key highlights
- 75 languages supported, from Afrikaans to Zulu, with a stated “quality over quantity” expansion policy
- Handles single words, word pairs, and full sentences; tested on 1000 samples per category per language
- Pure offline operation — no network calls, no model downloads at runtime
- Benchmarked against three competing Rust libraries with published accuracy plots and tables
- Apache 2.0 licensed, available on crates.io with Python bindings also maintained
Caveats
- The benchmark table is truncated mid-cell in the README; multiple-thread numbers for Lingua are cut off
- Speed is explicitly the trade-off: “Whichlang has the shortest processing time, Lingua the longest”
- 75 languages is fewer than some competitors if you need broader coverage
Verdict
Grab this if you need reliable language detection on messy, short, or mixed input and can spare some CPU cycles. Skip it if raw throughput matters more than accuracy, or if you need languages outside the supported set.