← all repositories
lionsoul2014/jcseg

A 920-star Java tokenizer that takes Chinese text apart with surgical precision

Jcseg is a long-running Java NLP library that segments CJK and English text, extracts keywords and summaries, and plugs directly into Lucene, Solr, Elasticsearch, and OpenSearch.

920 stars Java Language ModelsData Tooling
jcseg
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

Jcseg tokenizes Chinese, Japanese, Korean, and English text using the MMSEG algorithm with seven different segmentation modes—from a fast “simple” mode to an NLP mode that recognizes emails, phone numbers, and place names. It also extracts keywords, key phrases, key sentences, and auto-summaries using TextRank (plus BM25 for summaries). A built-in Jetty server exposes everything via HTTP with JSON output.

The interesting bit

The library doesn’t just split text; it chases down edge cases that break most tokenizers. Chinese fractions like “三十分之二” become “2/30”, “B超” and “卡拉ok” stay intact as mixed-language tokens, and it can even segment camelCase-like English strings such as “openarkcompiler” into “open ark compiler” using the same MMSEG logic. The README includes a gloriously dense test paragraph covering Tom Cruise’s divorce, Sichuan mala tang, and C++ book titles—all tokenized with part-of-speech tags.

Key highlights

  • Seven segmentation modes including FMM, MMSEG with four ambiguity filters, n-gram, and a dedicated “search” mode for fine-grained retrieval
  • Plugs into Lucene, Solr, Elasticsearch, and OpenSearch as custom analyzers/tokenizers
  • Auto-recognizes Chinese names (~94% accuracy, ~98% with rules), emails, URLs, phone numbers, currencies, datetime, and custom entities via lexicon
  • Supports traditional/simplified Chinese conversion, pinyin appending, and synonym matching from 《现代汉语词典》 and CC-CEDICT
  • Hot-reloads dictionaries via a daemon thread watching for lex-autoload.todo changes
  • Standalone HTTP server with REST API for polyglot access

Caveats

  • POS tagging is explicitly noted as “not very ideal” and not recommended for applications requiring high accuracy
  • The project appears to have last released version 2.6.3; activity level is unclear from the README
  • Some features like synonym completion from 《中华同义词词典》 are marked as unfinished

Verdict

Worth a look if you’re building Chinese-language search in Java and need something battle-tested with broad Lucene-ecosystem integration. Skip it if you need state-of-the-art deep-learning-based segmentation or if your stack isn’t JVM-based—the HTTP server helps, but the real value is in the analyzer plugins.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.