← all repositories
atilika/kuromoji

Japanese tokenization without the tears

A self-contained Java morphological analyzer that ships its own dictionaries so you don't have to wrestle with MeCab.

kuromoji
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

Kuromoji segments Japanese text into morphemes, tags parts of speech, lemmatizes inflected verbs, and extracts kanji readings — all from a single Maven dependency. No external dictionary files to hunt down, no native bindings to break. You instantiate a Tokenizer, call tokenize(), and get back a list of Token objects with surface forms, POS tags, and readings.

The interesting bit

The project bundles six different dictionaries (IPADIC, UniDic, JUMANDIC, NAIST jdic, plus NEologd and kana-accent variants) as separate Maven artifacts. That sounds like overkill until you realize Japanese NLP people argue about dictionary coverage the way other developers argue about tabs versus spaces. The README even admits it can’t tell you which to pick — “a boring answer” — and gently nudges newcomers toward kuromoji-ipadic.

Key highlights

  • Ships dictionaries inside the JAR; zero external configuration
  • Supports word segmentation, POS tagging, lemmatization, and kanji readings out of the box
  • Six dictionary flavors with distinct Maven coordinates and Token classes
  • Build profiles for benchmarking against Japanese Wikipedia (~765 MB download)
  • Apache 2.0 licensed, including third-party dictionary data

Caveats

  • Two NEologd variants are listed as “will be available from Maven Central in a future version” — unclear if that happened
  • The kuromoji-unidic-neologd package name contains a typo (kanaaneologdcent) in the README, suggesting limited maintenance attention
  • Last release appears to be 0.9.0; snapshot builds show 1.0-SNAPSHOT

Verdict

Worth a look if you’re building search or NLP in Java and need Japanese tokenization without native dependencies. Skip it if you’re already invested in MeCab or need actively maintained bindings for Python/Go/Rust.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.