← all repositories
lovit/soynlp

Korean NLP without the training-data treadmill

A pure-Python toolkit that extracts words, tokenizes, and tags parts of speech from raw Korean text—no labeled corpora required.

984 stars Python Data Tooling
soynlp
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

soynlp is a Korean NLP library built on unsupervised statistical methods. Feed it a pile of homogeneous documents—movie reviews, a day’s news, whatever—and it learns word boundaries, extracts nouns, tokenizes sentences, and tags parts of speech using cohesion scores, branching entropy, and accessor variety rather than pre-trained models.

The interesting bit

The noun extractor v2 decomposes compound nouns like ‘잠수함발사탄도미사일’ into (‘잠수함’, ‘발사’, ‘탄도미사일’) automatically, and exposes the L-R (left-right) graph structure so you can inspect what particles tend to attach to specific words. It’s the kind of linguistic plumbing most libraries hide.

Key highlights

  • Three noun extractors (v1, News, v2) with v2 recommended; v2 fixes accuracy and compound-noun recognition issues in earlier versions
  • WordExtractor scores candidates via cohesion forward/backward, left/right branching entropy, and accessor variety
  • LTokenizer splits Korean phrases on “L + R” boundaries (e.g., noun + particle) using learned word scores
  • Also includes MaxScoreTokenizer, RegexTokenizer, a normalizer, PMI calculator, and vectorizer
  • Pure Python, depends only on numpy, scipy, scikit-learn, and psutil

Caveats

  • Requires homogeneous document sets; mixing domains (news + social media) degrades extraction quality
  • Python 2.x support is untested; Python 3.5+ required, 3.x strongly recommended
  • Parameter naming changed in 0.0.47 (min/max standardization), so older code may need updates
  • Noun extractors are still in development and will eventually merge into a single class

Verdict

Worth a look if you’re working with Korean text and don’t have (or don’t want to curate) labeled training data. Skip it if you need battle-tested, production-grade morphological analysis with mature POS tagging—this is research-flavored tooling with visible rough edges.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.