← all repositories
pemistahl/lingua

A language detector that actually works on tweets

JVM language detection libraries traditionally choke on short text; Lingua was built specifically to handle single words and fragments without sacrificing accuracy on longer documents.

810 stars Kotlin Data Tooling
lingua
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

Lingua identifies which of 75 languages a given text is written in. It runs entirely offline, needs no configuration, and targets both long documents and very short snippets like single words or Twitter messages. The author built it as a lighter alternative to dragging in entire ML frameworks just to detect language.

The interesting bit

Most JVM language detectors fall apart on short text and get worse as you add more languages to the pool. Lingua combines rule-based and statistical methods without using word dictionaries, and the README includes extensive benchmark plots comparing it against Apache Tika, OpenNLP, and Optimaize across single words, word pairs, and sentences. The “quality over quantity” approach—75 languages, but well-supported—avoids the trap of claiming 200 languages and being wrong half the time.

Key highlights

  • Works offline with no external API calls
  • Supports 75 languages including several not covered by competing JVM libraries (Armenian, Azerbaijani, etc.)
  • Two accuracy modes: “high” and “low” (tradeoff unclear from README, but both are benchmarked)
  • Trained on one-million-sentence Leipzig University Wortschatz corpora; tested on held-out 10k-sentence sets
  • Kotlin library with Java interop, published to Maven Central

Caveats

  • The README is upfront that adding more languages may dilute accuracy, so the 75-language set is intentionally conservative
  • No explicit performance/latency numbers or memory footprint guidance provided
  • “Low accuracy mode” is mentioned but not explained—when to use it is left to the reader

Verdict

Worth a look if you’re on the JVM and need language detection without the weight of a full NLP pipeline. Skip it if you need real-time streaming at massive scale or coverage beyond the supported 75 languages; the author is clear that expansion happens carefully, not quickly.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.