← all repositories
facebookresearch/SentEval

The standardized torture test for sentence embeddings

Facebook Research's evaluation toolkit runs your sentence vectors through 17 downstream tasks and 10 probing tasks to see what they actually learned.

2.1k stars Python LLMOps · EvalLanguage Models
SentEval
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

SentEval is a Python evaluation framework for fixed-size sentence embeddings. You implement two functions—prepare (optional) and batcher—and it handles the rest, testing your embeddings as features across classification, similarity, and inference tasks. It supports both PyTorch and scikit-learn backends.

The interesting bit

The probing tasks are the twist. Beyond standard benchmarks like SST and SNLI, SentEval checks whether your embeddings implicitly encode syntax tree depth, verb tense, word order, and subject-object number agreement—essentially asking whether your model learned English or just memorized surface patterns.

Key highlights

  • 17 downstream transfer tasks (sentiment, NLI, paraphrase, STS, image-caption retrieval)
  • 10 linguistic probing tasks for diagnosing what embeddings actually capture
  • Example scripts for InferSent, SkipThought-LN, GenSen, and Google’s Universal Sentence Encoder
  • Configurable classifier: logistic regression or MLP with tunable hyperparameters
  • “Prototyping config” option that trades accuracy for roughly 5× speedup

Caveats

  • Python 2/3 compatibility and PyTorch ≥0.4 requirement suggest the codebase has not been refreshed recently
  • Some encoder examples (SkipThought, GenSen) require external setup and dependencies
  • MacOS users may need to swap unzip for p7zip when downloading datasets

Verdict

Essential if you’re training or comparing sentence encoders and want more than a single leaderboard number. Skip it if you’re working with contextualized embeddings (BERT-style) or need modern, actively maintained infrastructure.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.