← all repositories
polm/fugashi

Japanese tokenization that doesn't make you compile MeCab

A Cython wrapper that turns a finicky C++ tokenizer into a pip-installable Python library with sensible defaults.

523 stars C++ Data Tooling
fugashi
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

fugashi wraps MeCab, the venerable C++ Japanese morphological analyzer, in a Cython layer so you can pip install it on most platforms without touching a compiler. It ships with wheels for Linux, macOS (Intel), and Windows 64-bit, and bundles named-tuple access to UniDic’s rich feature data — lemmas, part-of-speech tags, and more — directly from Python.

The interesting bit

The author wrote this after finding that existing MeCab Python bindings were “hard to use and lack English documentation.” The fix: expose MeCab’s output as Python objects with .feature.lemma and .pos attributes, plus provide two curated dictionary packages — a 2013 “lite” version for quick starts, and the full 770MB modern UniDic for serious work.

Key highlights

  • pip install fugashi[unidic-lite] gets you tokenizing in seconds
  • Tagger assumes UniDic; GenericTagger works with arbitrary dictionaries via field-number access
  • create_feature_wrapper() lets you build named-tuple interfaces for custom dictionaries
  • Published at NLP-OSS 2020 with a proper academic citation
  • Interactive Streamlit demo at fugashi.streamlit.app

Caveats

  • No wheels for musl/Alpine Linux, PowerPC, or 32-bit Windows (build from source required)
  • Full UniDic needs a separate python -m unidic download step and ~770MB disk
  • Apple Silicon users: status unclear from README (Intel wheels mentioned explicitly)

Verdict

Worth a look if you’re doing Japanese NLP in Python and want MeCab’s accuracy without its deployment headaches. If you need Korean tokenization or refuse to install any C++ dependency at all, the README nudges you toward SudachiPy or pymecab-ko instead.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.