One-stop shop for judging your chatbot's word salad
A Python toolkit that bundles nine NLG evaluation metrics so you don't have to wire together BLEU, METEOR, and friends by hand.

What it does
nlg-eval is a Python wrapper that runs nine automated metrics on generated text. Feed it a hypothesis file and one or more reference files; it returns BLEU, METEOR, ROUGE, CIDEr, SPICE, and four embedding-based similarity scores. There’s a command-line tool, a functional API for one-off calls, and an object-oriented API if you’re batch-processing inside a script.
The interesting bit
The embedding metrics (SkipThought, GloVe average/extrema, greedy matching) are the less common ones. They attempt to capture semantic similarity beyond n-gram overlap, which is useful when your dialogue system paraphrases rather than regurgitates. The README is admirably blunt about CIDEr’s IDF gotcha and Meteor’s memory tuning — signs of a tool that has actually been used in anger.
Key highlights
- Nine metrics in one call, including four embedding-based similarity scores
- OO API caches loaded models for repeated evaluation in long-running scripts
NLGEVAL_DATAenvironment variable supports shared or Docker-mounted data directories- Setup script downloads ~6 GB of models and embeddings automatically
- Originated from a 2017 Maluuba paper on task-oriented dialogue evaluation
Caveats
- Requires Java 1.8+ and multithreading tweaks on modern macOS
- Windows setup can be fiddly; the
nlg-evalscript may not land on PATH - CIDEr’s default “corpus” IDF mode returns zero for single-example datasets; the README points you to external patches rather than fixing it
Verdict
Grab this if you’re evaluating dialogue systems or captioning models and want semantic metrics without plumbing them yourself. Skip it if you only need BLEU — sacrebleu is lighter and more actively maintained for that single purpose.