Is soynlp open source?

Yes — lovit/soynlp is an open-source project tracked on heatdrop.

What language is soynlp written in?

lovit/soynlp is primarily written in Python.

How popular is soynlp?

lovit/soynlp has 986 stars on GitHub.

Where can I find soynlp?

lovit/soynlp is on GitHub at https://github.com/lovit/soynlp.

← all repositories

lovit/soynlp

Korean NLP without the training-data treadmill

A pure-Python toolkit that extracts words, tokenizes, and tags parts of speech from raw Korean text—no labeled corpora required.

★986 stars Python Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

soynlp is a Korean NLP library built on unsupervised statistical methods. Feed it a pile of homogeneous documents—movie reviews, a day’s news, whatever—and it learns word boundaries, extracts nouns, tokenizes sentences, and tags parts of speech using cohesion scores, branching entropy, and accessor variety rather than pre-trained models.

The interesting bit

The noun extractor v2 decomposes compound nouns like ‘잠수함발사탄도미사일’ into (‘잠수함’, ‘발사’, ‘탄도미사일’) automatically, and exposes the L-R (left-right) graph structure so you can inspect what particles tend to attach to specific words. It’s the kind of linguistic plumbing most libraries hide.

Key highlights

Three noun extractors (v1, News, v2) with v2 recommended; v2 fixes accuracy and compound-noun recognition issues in earlier versions
WordExtractor scores candidates via cohesion forward/backward, left/right branching entropy, and accessor variety
LTokenizer splits Korean phrases on “L + R” boundaries (e.g., noun + particle) using learned word scores
Also includes MaxScoreTokenizer, RegexTokenizer, a normalizer, PMI calculator, and vectorizer
Pure Python, depends only on numpy, scipy, scikit-learn, and psutil

Caveats

Requires homogeneous document sets; mixing domains (news + social media) degrades extraction quality
Python 2.x support is untested; Python 3.5+ required, 3.x strongly recommended
Parameter naming changed in 0.0.47 (min/max standardization), so older code may need updates
Noun extractors are still in development and will eventually merge into a single class

Verdict

Worth a look if you’re working with Korean text and don’t have (or don’t want to curate) labeled training data. Skip it if you need battle-tested, production-grade morphological analysis with mature POS tagging—this is research-flavored tooling with visible rough edges.

Frequently asked

What is lovit/soynlp?: A pure-Python toolkit that extracts words, tokenizes, and tags parts of speech from raw Korean text—no labeled corpora required.
Is soynlp open source?: Yes — lovit/soynlp is an open-source project tracked on heatdrop.
What language is soynlp written in?: lovit/soynlp is primarily written in Python.
How popular is soynlp?: lovit/soynlp has 986 stars on GitHub.
Where can I find soynlp?: lovit/soynlp is on GitHub at https://github.com/lovit/soynlp.