← all repositories
fukuball/jieba-php

PHP's answer to Chinese text segmentation

A PHP port of the popular jieba library that slices Chinese text into meaningful words without calling an LLM API.

1.4k stars PHP Data ToolingOther AI
jieba-php
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

jieba-php segments Chinese text into words — a surprisingly hard problem since Chinese doesn’t use spaces. It offers three modes: precise (default), full (over-generates all possible words), and search-engine (splits long words for better recall). It also handles keyword extraction via TF-IDF, part-of-speech tagging, and custom dictionaries.

The interesting bit

The README is admirably honest: it admits LLMs now do better segmentation, but this runs locally, cheaply, and fast. Under the hood it uses a Trie tree to build a directed acyclic graph of possible word paths, then dynamic programming to find the highest-probability split. For unknown words, it falls back to an HMM with Viterbi decoding — classic NLP machinery that doesn’t need a GPU.

Key highlights

  • Three segmentation modes for different trade-offs between precision and recall
  • Supports traditional Chinese (switch dictionary to “big” mode)
  • CJK support: Chinese, Japanese, Korean text processing
  • Custom dictionaries with word frequency and part-of-speech tags
  • TF-IDF keyword extraction with stop-word filtering
  • Memory management and caching optimizations (critical given the ini_set('memory_limit', '1024M') in examples)

Caveats

  • Requires substantial memory: examples show 600M–1024M limits, suggesting the dictionary is loaded entirely into RAM
  • Manual installation path is tedious (multiple require_once statements); Composer is strongly preferred
  • README notes it originated as a translation of the Python jieba, though it now maintains its own branch

Verdict

Worth a look if you’re building search, indexing, or analytics in PHP and need Chinese segmentation without API dependencies. Skip it if you’re already running Python infrastructure or need state-of-the-art accuracy — the authors themselves suggest LLMs for that.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.