← all repositories
Tiiiger/bert_score

BLEU is blind to meaning. BERTScore isn't.

A reference metric that uses contextual embeddings to judge whether your generated text actually says the same thing as the reference.

1.9k stars Jupyter Notebook LLMOps · Eval
bert_score
Velocity · 7d
+0.7
★ / day
Trend
steady
star history

What it does

BERTScore evaluates text generation by matching words between candidate and reference sentences using cosine similarity on BERT embeddings. It outputs precision, recall, and F1 — the same familiar shapes as BLEU, but driven by semantic similarity rather than n-gram overlap. The project supports ~130 Hugging Face models and 104 languages via multilingual BERT.

The interesting bit

The authors maintain a public spreadsheet tracking which models correlate best with human judgment — currently microsoft/deberta-xlarge-mnli leads, not the default roberta-large. That kind of empirical honesty is rarer than it should be in evaluation metrics. They also provide a --rescale_with_baseline flag that stretches scores into a more human-readable range, plus a CLI tool that can visualize token-level matching heatmaps.

Key highlights

  • Python API (bert_score.score) and cached BERTScorer object for repeated evaluations
  • CLI with multi-reference support and custom model loading via --model and --num_layers
  • Rescaled baselines to widen score ranges; Google sheet tracking model-human correlations
  • GPU recommended; Google Colab demo provided for the compute-constrained
  • Integrated into Hugging Face’s datasets library as a built-in metric

Caveats

  • Computationally expensive enough that the README explicitly warns “a GPU is usually necessary”
  • Fast tokenizers produce different scores than standard ones, which is documented but easy to miss
  • Default model (roberta-large) is not the best-performing; you need to opt into deberta-xlarge-mnli for best correlation with human judgment

Verdict

Worth adopting if you’re still using BLEU or ROUGE for anything involving paraphrase, summarization, or translation. Skip it if you need a lightweight metric for real-time feedback loops — the BERT tax is real.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.