A field guide to not getting lost in Korean NLP
A curated list of tools, datasets, and papers for processing Korean text, because agglutinative morphology doesn't solve itself.

What it does
This is a curated awesome-list that catalogs resources for Korean-language NLP: morphological analyzers, datasets like Sejong and NamuWiki dumps, papers, lectures, and community links. It covers both Korean-specific tools (Hannanum, Kkma, Komoran, Mecab-ko) and language-agnostic packages with Korean bindings (KoNLPy, FastText, gensim).
The interesting bit
The list explicitly splits between “NLP of Korean text” and “NLP information written in Korean” — a useful distinction if you’re hunting for tools versus hunting for tutorials you can actually read. The maintainer also keeps a live collabedit link for casual contributions, which feels charmingly retro.
Key highlights
- Morpheme analyzers: 12+ options including Java stalwarts (Hannanum, Kkma), C++ workhorses (Mecab-ko), and newer entrants (Rouzeta, seunjeon)
- Datasets: Government corpora (Sejong, KAIST), web dumps (Wikipedia, NamuWiki), and sentiment-labeled data (Naver movie corpus)
- Bindings matter: KoNLPy wraps multiple Java analyzers for Python; kroman ports Hangul romanization across five languages
- Community links: Korean-language NLP conferences since 1989, plus active Facebook groups (Tensorflow KR, AI Korea)
- Odd gems: A crowdsourced Korean profanity dictionary and a TextRank summarizer demo running on Heroku
Caveats
- Several paper links are dead (marked with strikethrough), and the English papers section is empty
- Some tool links point to Korean-only pages or SourceForge projects that may be unmaintained
- The “collabedit” contribution method suggests the list may not see frequent structured updates
Verdict
Worth bookmarking if you’re doing Korean NLP and tired of re-discovering that Mecab-ko exists. Skip it if you need actively maintained, benchmarked comparisons — this is a directory, not a review site.