← all repositories
bab2min/Kiwi

A Korean tokenizer that outruns its rivals and fixes your typos

Kiwi is a fast, open-source Korean morphological analyzer with built-in typo correction and bindings for nearly every language you might actually use.

729 stars C++ Data Tooling
Kiwi
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

Kiwi segments Korean text into morphemes—nouns, verbs, particles, endings, and the rest—using the Sejong tag set. It claims ~87% accuracy on web text and ~94% on written text, and since version 0.13.0 it can auto-correct simple typos during analysis. The core is C++, but the project has accumulated wrappers for Python, Java, C#, Go, R, Rust, Flutter, WebAssembly, and even an Android AAR.

The interesting bit

The project ships its own lightweight language model for disambiguation, which is unusual for a “fast” tokenizer. The README shows benchmark charts suggesting it keeps pace with or outruns competitors while still resolving ambiguous splits. Multithreading is built into the library itself, not bolted on by wrappers.

Key highlights

  • Core library in C++17 with prebuilt binaries for Windows, Linux, macOS, Android, plus ARM64 and PPC64LE
  • Auto typo correction (0.13.0+) with eval data showing recovery on web_with_typos.txt
  • Sentence splitting and tokenization benchmarks published, with links to reproduce
  • Web demo at kiwi.bab2min.pe.kr for quick testing
  • Active CI across x86_64, ARM64, PPC64LE, and WASM

Caveats

  • Swift wrapper is “coming soon” as of the README
  • Model files live in Git LFS; clone without it and you will have a bad time
  • The typo-correction mode loads slower and uses ~2.5× the memory (693 MB vs 278 MB in the sample run)

Verdict

Worth a look if you process Korean text at scale and need speed without sacrificing accuracy. Skip it if you only need English tokenization or if you are allergic to downloading large model files.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.