← all repositories
yooper/php-text-analysis

NLP for the PHP holdouts: tokenize without leaving your stack

A PHP-native library that brings text analysis, sentiment scoring, and document classification to codebases that can't justify a Python microservice.

php-text-analysis
Velocity · 7d
+0.1
★ / day
Trend
steady
star history

What it does

php-text-analysis is a PHP library for common NLP and information-retrieval tasks: tokenization, stemming, frequency analysis, n-grams, sentiment scoring with VADER, keyword extraction with RAKE, and naive Bayes document classification. It exposes most operations through plain helper functions (tokenize(), stem(), vader(), naive_bayes()) so you can get from raw text to results in a few lines.

The interesting bit

The library wraps well-known algorithms—Porter stemmer, Penn TreeBank tokenizer, VADER sentiment—in a single Composer package with a consistent PHP API. That’s the value: not algorithmic novelty, but keeping text-processing logic inside a PHP monolith instead of ferrying data to a Python service.

Key highlights

  • Tokenizers are swappable; default is GeneralTokenizer, but PennTreeBankTokenizer and others can be passed by class name
  • normalize_tokens() accepts custom callbacks or string function names (e.g., mb_strtolower)
  • Built-in RAKE keyword extraction and VADER sentiment analysis, both invoked through one-line helpers
  • Naive Bayes classifier with a simple train()/predict() interface; movie-review example in the unit tests
  • N-gram generation defaults to bigrams but supports custom lengths and delimiters

Caveats

  • Documentation is split across an unfinished book repo and a wiki; the README itself is mostly a function reference
  • Some tokenizers “require parameters to be set upon instantiation”—the README notes this but doesn’t explain which ones or how
  • No benchmarks, accuracy metrics, or corpus size guidance is provided

Verdict

Worth a look if you’re maintaining a PHP application that needs light NLP and can’t absorb the operational cost of a second runtime. If you’re starting fresh or doing heavy text processing, Python’s ecosystem is still the pragmatic choice.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.