← all repositories
pemistahl/lingua-rs

A language detector that actually reads your tweets

Rust library that identifies 75 languages from single words up to long documents, no neural networks or API calls required.

1.1k stars Rust Data Tooling
lingua-rs
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

Lingua-rs tells you what language a piece of text is written in. It handles 75 languages, works offline, and needs no configuration. The author built it because existing Rust detectors choke on short snippets like tweets or single words, and get worse the more languages you ask them to consider.

The interesting bit

It mixes rule-based logic with statistical Naive Bayes methods, but deliberately skips neural networks, word dictionaries, and external APIs. The author trained models on one million sentences per language from Leipzig University’s Wortschatz corpora, then tested on held-out website data. The README includes extensive box plots comparing Lingua against CLD2, Whatlang, and Whichlang — with a careful apples-to-apples breakdown showing that Whichlang only supports 16 languages, which flatters its accuracy numbers until you level the playing field.

Key highlights

  • 75 languages supported, from Afrikaans to Zulu, with a stated “quality over quantity” expansion policy
  • Handles single words, word pairs, and full sentences; tested on 1000 samples per category per language
  • Pure offline operation — no network calls, no model downloads at runtime
  • Benchmarked against three competing Rust libraries with published accuracy plots and tables
  • Apache 2.0 licensed, available on crates.io with Python bindings also maintained

Caveats

  • The benchmark table is truncated mid-cell in the README; multiple-thread numbers for Lingua are cut off
  • Speed is explicitly the trade-off: “Whichlang has the shortest processing time, Lingua the longest”
  • 75 languages is fewer than some competitors if you need broader coverage

Verdict

Grab this if you need reliable language detection on messy, short, or mixed input and can spare some CPU cycles. Skip it if raw throughput matters more than accuracy, or if you need languages outside the supported set.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.