The standardized torture test for sentence embeddings
Facebook Research's evaluation toolkit runs your sentence vectors through 17 downstream tasks and 10 probing tasks to see what they actually learned.

What it does
SentEval is a Python evaluation framework for fixed-size sentence embeddings. You implement two functions—prepare (optional) and batcher—and it handles the rest, testing your embeddings as features across classification, similarity, and inference tasks. It supports both PyTorch and scikit-learn backends.
The interesting bit
The probing tasks are the twist. Beyond standard benchmarks like SST and SNLI, SentEval checks whether your embeddings implicitly encode syntax tree depth, verb tense, word order, and subject-object number agreement—essentially asking whether your model learned English or just memorized surface patterns.
Key highlights
- 17 downstream transfer tasks (sentiment, NLI, paraphrase, STS, image-caption retrieval)
- 10 linguistic probing tasks for diagnosing what embeddings actually capture
- Example scripts for InferSent, SkipThought-LN, GenSen, and Google’s Universal Sentence Encoder
- Configurable classifier: logistic regression or MLP with tunable hyperparameters
- “Prototyping config” option that trades accuracy for roughly 5× speedup
Caveats
- Python 2/3 compatibility and PyTorch ≥0.4 requirement suggest the codebase has not been refreshed recently
- Some encoder examples (SkipThought, GenSen) require external setup and dependencies
- MacOS users may need to swap
unzipforp7zipwhen downloading datasets
Verdict
Essential if you’re training or comparing sentence encoders and want more than a single leaderboard number. Skip it if you’re working with contextualized embeddings (BERT-style) or need modern, actively maintained infrastructure.