PHP's answer to Chinese text segmentation
A PHP port of the popular jieba library that slices Chinese text into meaningful words without calling an LLM API.

What it does
jieba-php segments Chinese text into words — a surprisingly hard problem since Chinese doesn’t use spaces. It offers three modes: precise (default), full (over-generates all possible words), and search-engine (splits long words for better recall). It also handles keyword extraction via TF-IDF, part-of-speech tagging, and custom dictionaries.
The interesting bit
The README is admirably honest: it admits LLMs now do better segmentation, but this runs locally, cheaply, and fast. Under the hood it uses a Trie tree to build a directed acyclic graph of possible word paths, then dynamic programming to find the highest-probability split. For unknown words, it falls back to an HMM with Viterbi decoding — classic NLP machinery that doesn’t need a GPU.
Key highlights
- Three segmentation modes for different trade-offs between precision and recall
- Supports traditional Chinese (switch dictionary to “big” mode)
- CJK support: Chinese, Japanese, Korean text processing
- Custom dictionaries with word frequency and part-of-speech tags
- TF-IDF keyword extraction with stop-word filtering
- Memory management and caching optimizations (critical given the
ini_set('memory_limit', '1024M')in examples)
Caveats
- Requires substantial memory: examples show 600M–1024M limits, suggesting the dictionary is loaded entirely into RAM
- Manual installation path is tedious (multiple
require_oncestatements); Composer is strongly preferred - README notes it originated as a translation of the Python jieba, though it now maintains its own branch
Verdict
Worth a look if you’re building search, indexing, or analytics in PHP and need Chinese segmentation without API dependencies. Skip it if you’re already running Python infrastructure or need state-of-the-art accuracy — the authors themselves suggest LLMs for that.