Japanese tokenization that doesn't make you compile MeCab
A Cython wrapper that turns a finicky C++ tokenizer into a pip-installable Python library with sensible defaults.

What it does
fugashi wraps MeCab, the venerable C++ Japanese morphological analyzer, in a Cython layer so you can pip install it on most platforms without touching a compiler. It ships with wheels for Linux, macOS (Intel), and Windows 64-bit, and bundles named-tuple access to UniDic’s rich feature data — lemmas, part-of-speech tags, and more — directly from Python.
The interesting bit
The author wrote this after finding that existing MeCab Python bindings were “hard to use and lack English documentation.” The fix: expose MeCab’s output as Python objects with .feature.lemma and .pos attributes, plus provide two curated dictionary packages — a 2013 “lite” version for quick starts, and the full 770MB modern UniDic for serious work.
Key highlights
pip install fugashi[unidic-lite]gets you tokenizing in secondsTaggerassumes UniDic;GenericTaggerworks with arbitrary dictionaries via field-number accesscreate_feature_wrapper()lets you build named-tuple interfaces for custom dictionaries- Published at NLP-OSS 2020 with a proper academic citation
- Interactive Streamlit demo at fugashi.streamlit.app
Caveats
- No wheels for musl/Alpine Linux, PowerPC, or 32-bit Windows (build from source required)
- Full UniDic needs a separate
python -m unidic downloadstep and ~770MB disk - Apple Silicon users: status unclear from README (Intel wheels mentioned explicitly)
Verdict
Worth a look if you’re doing Japanese NLP in Python and want MeCab’s accuracy without its deployment headaches. If you need Korean tokenization or refuse to install any C++ dependency at all, the README nudges you toward SudachiPy or pymecab-ko instead.