← all repositories
aboSamoor/polyglot

NLP for the other 6.5 billion people

A Python toolkit that treats 165-language tokenization as table stakes, not a stretch goal.

2.4k stars Python Language ModelsML Frameworks
polyglot
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

Polyglot is a Python NLP pipeline built for breadth over depth. It detects 196 languages, tokenizes 165 of them, and runs sentiment analysis, embeddings, morphology, and transliteration across triple-digit language counts. The API is deliberately simple: wrap a string in Text() or Word() and call methods like .pos_tags, .entities, or .polarity.

The interesting bit

The project inverts the usual NLP hierarchy. English gets 16 languages’ worth of POS tagging; everyone else gets tokenization and embeddings at minimum. The README’s German NER example outputs raw I-LOC and I-PER tags with escaped Unicode—no polish, just proof it works. That’s the aesthetic: coverage first, refinement later.

Key highlights

  • 196-language detection, 165-language tokenization, 137-language word embeddings
  • Single-object API: Text(string).words, Word(string, language="en").neighbors
  • Morphological decomposition (“Preprocessing” → ['Pre', 'process', 'ing'])
  • Cyrillic transliteration: English “preprocessing” becomes препрокессинг
  • GPLv3 licensed, Travis CI + ReadTheDocs infrastructure

Caveats

  • POS tagging only covers 16 languages; coverage is uneven across features
  • Last significant README activity appears pre-2017 (Travis CI badge, Python 2 u"" strings in examples)
  • No candidate images or screenshots provided in repository

Verdict

Grab this if you’re prototyping multilingual pipelines and need broad language detection or transliteration without training custom models. Skip it if you need state-of-the-art accuracy on English-only tasks—spaCy or Stanza have overtaken it there.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.