COMET: grading machine translations with neural metrics
A neural framework that scores translation quality—and now explains why your MT output flunked.

What it does
COMET evaluates machine translation quality using pretrained neural models. It scores translations against references (0 to 1 scale, where 1 is perfect), runs reference-free evaluation when you lack gold standards, and can rank multiple systems with statistical significance testing. The CLI handles everything from single-file scoring to WMT benchmark evaluation via SacreBLEU integration.
The interesting bit
The newer XCOMET models don’t just spit out numbers—they identify error spans and classify them as minor, major, or critical per MQM typology, with free-text explanations. There’s also DocCOMET for document-level evaluation using context, which helps with discourse phenomena and chat translation quality. The 10.7B-parameter XCOMET-XXL is their current best correlation with human MQM judgments.
Key highlights
- Reference-based and reference-free models available; reference-free built on InfoXLM, reference-based on XLM-R
- XCOMET-XL/XXL models provide explainable error analysis with span detection and severity classification
- Document-level extension (DocCOMET) uses context separators (
</s>) for discourse-aware scoring - Minimum Bayes Risk decoding via
comet-mbrto select best translation from candidate lists - Statistical significance testing via
comet-comparewith paired t-test and bootstrap resampling - Some models require Hugging Face Hub license acknowledgment and login
Caveats
- Pre-2022 papers used different model checkpoints; comparing across COMET versions needs care
- License terms vary by model—check LICENSE.models before using in production
- The 3.5B and 10.7B parameter models are hefty; CPU-only scoring will hurt
Verdict
MT researchers and engineers who need more than BLEU should grab this. If you’re not doing translation evaluation, or if you need lightweight, fast metrics without GPU overhead, look elsewhere.