A 920-star Java tokenizer that takes Chinese text apart with surgical precision
Jcseg is a long-running Java NLP library that segments CJK and English text, extracts keywords and summaries, and plugs directly into Lucene, Solr, Elasticsearch, and OpenSearch.

What it does
Jcseg tokenizes Chinese, Japanese, Korean, and English text using the MMSEG algorithm with seven different segmentation modes—from a fast “simple” mode to an NLP mode that recognizes emails, phone numbers, and place names. It also extracts keywords, key phrases, key sentences, and auto-summaries using TextRank (plus BM25 for summaries). A built-in Jetty server exposes everything via HTTP with JSON output.
The interesting bit
The library doesn’t just split text; it chases down edge cases that break most tokenizers. Chinese fractions like “三十分之二” become “2/30”, “B超” and “卡拉ok” stay intact as mixed-language tokens, and it can even segment camelCase-like English strings such as “openarkcompiler” into “open ark compiler” using the same MMSEG logic. The README includes a gloriously dense test paragraph covering Tom Cruise’s divorce, Sichuan mala tang, and C++ book titles—all tokenized with part-of-speech tags.
Key highlights
- Seven segmentation modes including FMM, MMSEG with four ambiguity filters, n-gram, and a dedicated “search” mode for fine-grained retrieval
- Plugs into Lucene, Solr, Elasticsearch, and OpenSearch as custom analyzers/tokenizers
- Auto-recognizes Chinese names (~94% accuracy, ~98% with rules), emails, URLs, phone numbers, currencies, datetime, and custom entities via lexicon
- Supports traditional/simplified Chinese conversion, pinyin appending, and synonym matching from 《现代汉语词典》 and CC-CEDICT
- Hot-reloads dictionaries via a daemon thread watching for
lex-autoload.todochanges - Standalone HTTP server with REST API for polyglot access
Caveats
- POS tagging is explicitly noted as “not very ideal” and not recommended for applications requiring high accuracy
- The project appears to have last released version 2.6.3; activity level is unclear from the README
- Some features like synonym completion from 《中华同义词词典》 are marked as unfinished
Verdict
Worth a look if you’re building Chinese-language search in Java and need something battle-tested with broad Lucene-ecosystem integration. Skip it if you need state-of-the-art deep-learning-based segmentation or if your stack isn’t JVM-based—the HTTP server helps, but the real value is in the analyzer plugins.