NLP for the PHP holdouts: tokenize without leaving your stack
A PHP-native library that brings text analysis, sentiment scoring, and document classification to codebases that can't justify a Python microservice.

What it does
php-text-analysis is a PHP library for common NLP and information-retrieval tasks: tokenization, stemming, frequency analysis, n-grams, sentiment scoring with VADER, keyword extraction with RAKE, and naive Bayes document classification. It exposes most operations through plain helper functions (tokenize(), stem(), vader(), naive_bayes()) so you can get from raw text to results in a few lines.
The interesting bit
The library wraps well-known algorithms—Porter stemmer, Penn TreeBank tokenizer, VADER sentiment—in a single Composer package with a consistent PHP API. That’s the value: not algorithmic novelty, but keeping text-processing logic inside a PHP monolith instead of ferrying data to a Python service.
Key highlights
- Tokenizers are swappable; default is
GeneralTokenizer, butPennTreeBankTokenizerand others can be passed by class name normalize_tokens()accepts custom callbacks or string function names (e.g.,mb_strtolower)- Built-in RAKE keyword extraction and VADER sentiment analysis, both invoked through one-line helpers
- Naive Bayes classifier with a simple
train()/predict()interface; movie-review example in the unit tests - N-gram generation defaults to bigrams but supports custom lengths and delimiters
Caveats
- Documentation is split across an unfinished book repo and a wiki; the README itself is mostly a function reference
- Some tokenizers “require parameters to be set upon instantiation”—the README notes this but doesn’t explain which ones or how
- No benchmarks, accuracy metrics, or corpus size guidance is provided
Verdict
Worth a look if you’re maintaining a PHP application that needs light NLP and can’t absorb the operational cost of a second runtime. If you’re starting fresh or doing heavy text processing, Python’s ecosystem is still the pragmatic choice.