Is nlg-eval open source?

Yes — Maluuba/nlg-eval is an open-source project tracked on heatdrop.

What language is nlg-eval written in?

Maluuba/nlg-eval is primarily written in Python.

How popular is nlg-eval?

Maluuba/nlg-eval has 1.4k stars on GitHub.

Where can I find nlg-eval?

Maluuba/nlg-eval is on GitHub at https://github.com/Maluuba/nlg-eval.

← all repositories

Maluuba/nlg-eval

One-stop shop for judging your chatbot's word salad

A Python toolkit that bundles nine NLG evaluation metrics so you don't have to wire together BLEU, METEOR, and friends by hand.

★1.4k stars Python LLMOps · Eval Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

nlg-eval is a Python wrapper that runs nine automated metrics on generated text. Feed it a hypothesis file and one or more reference files; it returns BLEU, METEOR, ROUGE, CIDEr, SPICE, and four embedding-based similarity scores. There’s a command-line tool, a functional API for one-off calls, and an object-oriented API if you’re batch-processing inside a script.

The interesting bit

The embedding metrics (SkipThought, GloVe average/extrema, greedy matching) are the less common ones. They attempt to capture semantic similarity beyond n-gram overlap, which is useful when your dialogue system paraphrases rather than regurgitates. The README is admirably blunt about CIDEr’s IDF gotcha and Meteor’s memory tuning — signs of a tool that has actually been used in anger.

Key highlights

Nine metrics in one call, including four embedding-based similarity scores
OO API caches loaded models for repeated evaluation in long-running scripts
NLGEVAL_DATA environment variable supports shared or Docker-mounted data directories
Setup script downloads ~6 GB of models and embeddings automatically
Originated from a 2017 Maluuba paper on task-oriented dialogue evaluation

Caveats

Requires Java 1.8+ and multithreading tweaks on modern macOS
Windows setup can be fiddly; the nlg-eval script may not land on PATH
CIDEr’s default “corpus” IDF mode returns zero for single-example datasets; the README points you to external patches rather than fixing it

Verdict

Grab this if you’re evaluating dialogue systems or captioning models and want semantic metrics without plumbing them yourself. Skip it if you only need BLEU — sacrebleu is lighter and more actively maintained for that single purpose.

Frequently asked

What is Maluuba/nlg-eval?: A Python toolkit that bundles nine NLG evaluation metrics so you don't have to wire together BLEU, METEOR, and friends by hand.
Is nlg-eval open source?: Yes — Maluuba/nlg-eval is an open-source project tracked on heatdrop.
What language is nlg-eval written in?: Maluuba/nlg-eval is primarily written in Python.
How popular is nlg-eval?: Maluuba/nlg-eval has 1.4k stars on GitHub.
Where can I find nlg-eval?: Maluuba/nlg-eval is on GitHub at https://github.com/Maluuba/nlg-eval.