Korean NLP without the training-data treadmill
A pure-Python toolkit that extracts words, tokenizes, and tags parts of speech from raw Korean text—no labeled corpora required.

What it does
soynlp is a Korean NLP library built on unsupervised statistical methods. Feed it a pile of homogeneous documents—movie reviews, a day’s news, whatever—and it learns word boundaries, extracts nouns, tokenizes sentences, and tags parts of speech using cohesion scores, branching entropy, and accessor variety rather than pre-trained models.
The interesting bit
The noun extractor v2 decomposes compound nouns like ‘잠수함발사탄도미사일’ into (‘잠수함’, ‘발사’, ‘탄도미사일’) automatically, and exposes the L-R (left-right) graph structure so you can inspect what particles tend to attach to specific words. It’s the kind of linguistic plumbing most libraries hide.
Key highlights
- Three noun extractors (v1, News, v2) with v2 recommended; v2 fixes accuracy and compound-noun recognition issues in earlier versions
- WordExtractor scores candidates via cohesion forward/backward, left/right branching entropy, and accessor variety
- LTokenizer splits Korean phrases on “L + R” boundaries (e.g., noun + particle) using learned word scores
- Also includes MaxScoreTokenizer, RegexTokenizer, a normalizer, PMI calculator, and vectorizer
- Pure Python, depends only on numpy, scipy, scikit-learn, and psutil
Caveats
- Requires homogeneous document sets; mixing domains (news + social media) degrades extraction quality
- Python 2.x support is untested; Python 3.5+ required, 3.x strongly recommended
- Parameter naming changed in 0.0.47 (min/max standardization), so older code may need updates
- Noun extractors are still in development and will eventually merge into a single class
Verdict
Worth a look if you’re working with Korean text and don’t have (or don’t want to curate) labeled training data. Skip it if you need battle-tested, production-grade morphological analysis with mature POS tagging—this is research-flavored tooling with visible rough edges.