← all repositories
patrickschur/language-detection

PHP language detection without calling Google Translate

A self-contained n-gram library that trains on 110 languages and runs entirely offline.

855 stars PHP Data Tooling
language-detection
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

Feed it a string of text, get back ranked language guesses with confidence scores. It ships with pre-trained models for 110 languages and a Trainer class to roll your own—whether that’s Klingon, spam vs. ham, or something more practical.

The interesting bit

The library compiles n-gram frequency data into plain PHP arrays rather than JSON (since v4), which is a blunt-force but effective way to dodge parse overhead. You can also cap the n-gram count to trade accuracy for speed, or whitelist specific languages to skip comparisons you don’t need.

Key highlights

  • 110 built-in languages, with trainable support for custom ones
  • Method chaining: detect()->blacklist('de')->limit(3)->close()
  • ArrayAccess lets you pluck scores like $result['nl']
  • Custom tokenizers via TokenizerInterface for domain-specific text
  • Requires PHP ≥ 7.4 and the mbstring extension

Caveats

  • Needs “some sentences” for reliable detection; short strings are dicey
  • Training with large n-gram counts (the README suggests ~9,000 for better accuracy) is slow, though detection speed stays flat
  • Upgrading from v3 requires regenerating custom training files from JSON to PHP

Verdict

Worth a look if you’re building a PHP app that needs offline language detection without pulling in heavy ML dependencies. Skip it if you’re already running Python or need real-time detection on single words.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.