Is jcseg open source?

Yes — lionsoul2014/jcseg is open source, released under the Apache-2.0 license.

What language is jcseg written in?

lionsoul2014/jcseg is primarily written in Java.

How popular is jcseg?

lionsoul2014/jcseg has 921 stars on GitHub.

Where can I find jcseg?

lionsoul2014/jcseg is on GitHub at https://github.com/lionsoul2014/jcseg.

← all repositories

lionsoul2014/jcseg

A 920-star Java tokenizer that takes Chinese text apart with surgical precision

Jcseg is a long-running Java NLP library that segments CJK and English text, extracts keywords and summaries, and plugs directly into Lucene, Solr, Elasticsearch, and OpenSearch.

★921 stars Java Language Models Data Tooling

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Jcseg tokenizes Chinese, Japanese, Korean, and English text using the MMSEG algorithm with seven different segmentation modes—from a fast “simple” mode to an NLP mode that recognizes emails, phone numbers, and place names. It also extracts keywords, key phrases, key sentences, and auto-summaries using TextRank (plus BM25 for summaries). A built-in Jetty server exposes everything via HTTP with JSON output.

The interesting bit

The library doesn’t just split text; it chases down edge cases that break most tokenizers. Chinese fractions like “三十分之二” become “2/30”, “B超” and “卡拉ok” stay intact as mixed-language tokens, and it can even segment camelCase-like English strings such as “openarkcompiler” into “open ark compiler” using the same MMSEG logic. The README includes a gloriously dense test paragraph covering Tom Cruise’s divorce, Sichuan mala tang, and C++ book titles—all tokenized with part-of-speech tags.

Key highlights

Seven segmentation modes including FMM, MMSEG with four ambiguity filters, n-gram, and a dedicated “search” mode for fine-grained retrieval
Plugs into Lucene, Solr, Elasticsearch, and OpenSearch as custom analyzers/tokenizers
Auto-recognizes Chinese names (~94% accuracy, ~98% with rules), emails, URLs, phone numbers, currencies, datetime, and custom entities via lexicon
Supports traditional/simplified Chinese conversion, pinyin appending, and synonym matching from 《现代汉语词典》 and CC-CEDICT
Hot-reloads dictionaries via a daemon thread watching for lex-autoload.todo changes
Standalone HTTP server with REST API for polyglot access

Caveats

POS tagging is explicitly noted as “not very ideal” and not recommended for applications requiring high accuracy
The project appears to have last released version 2.6.3; activity level is unclear from the README
Some features like synonym completion from 《中华同义词词典》 are marked as unfinished

Verdict

Worth a look if you’re building Chinese-language search in Java and need something battle-tested with broad Lucene-ecosystem integration. Skip it if you need state-of-the-art deep-learning-based segmentation or if your stack isn’t JVM-based—the HTTP server helps, but the real value is in the analyzer plugins.

Frequently asked

What is lionsoul2014/jcseg?: Jcseg is a long-running Java NLP library that segments CJK and English text, extracts keywords and summaries, and plugs directly into Lucene, Solr, Elasticsearch, and OpenSearch.
Is jcseg open source?: Yes — lionsoul2014/jcseg is open source, released under the Apache-2.0 license.
What language is jcseg written in?: lionsoul2014/jcseg is primarily written in Java.
How popular is jcseg?: lionsoul2014/jcseg has 921 stars on GitHub.
Where can I find jcseg?: lionsoul2014/jcseg is on GitHub at https://github.com/lionsoul2014/jcseg.