NLP for the other 6.5 billion people
A Python toolkit that treats 165-language tokenization as table stakes, not a stretch goal.

What it does
Polyglot is a Python NLP pipeline built for breadth over depth. It detects 196 languages, tokenizes 165 of them, and runs sentiment analysis, embeddings, morphology, and transliteration across triple-digit language counts. The API is deliberately simple: wrap a string in Text() or Word() and call methods like .pos_tags, .entities, or .polarity.
The interesting bit
The project inverts the usual NLP hierarchy. English gets 16 languages’ worth of POS tagging; everyone else gets tokenization and embeddings at minimum. The README’s German NER example outputs raw I-LOC and I-PER tags with escaped Unicode—no polish, just proof it works. That’s the aesthetic: coverage first, refinement later.
Key highlights
- 196-language detection, 165-language tokenization, 137-language word embeddings
- Single-object API:
Text(string).words,Word(string, language="en").neighbors - Morphological decomposition (“Preprocessing” →
['Pre', 'process', 'ing']) - Cyrillic transliteration: English “preprocessing” becomes
препрокессинг - GPLv3 licensed, Travis CI + ReadTheDocs infrastructure
Caveats
- POS tagging only covers 16 languages; coverage is uneven across features
- Last significant README activity appears pre-2017 (Travis CI badge, Python 2
u""strings in examples) - No candidate images or screenshots provided in repository
Verdict
Grab this if you’re prototyping multilingual pipelines and need broad language detection or transliteration without training custom models. Skip it if you need state-of-the-art accuracy on English-only tasks—spaCy or Stanza have overtaken it there.