How much does BERT actually know? This probe finds out.
LAMA is a standardized benchmark for extracting and comparing factual knowledge across pretrained language models.

What it does LAMA provides a consistent interface to test whether pretrained language models—BERT, GPT, RoBERTa, ELMo, Transformer-XL—contain factual and commonsense knowledge. It uses cloze-style probes (fill-in-the-[MASK]) to see if a model can complete statements like “The theory of relativity was developed by [MASK].” The package also lets you encode sentences to embeddings and compare model outputs side-by-side.
The interesting bit The project treats language models as implicit knowledge bases and asks: can we query them like one? It ships with a unified vocabulary intersection across all supported models, so comparisons are less confounded by tokenization differences. The “Negated-LAMA” variant even tests whether models handle negation—spoiler, often poorly.
Key highlights
- Supports five major model families through a single CLI interface
- Includes pre-built datasets and a ~55 GB model download script
- Can encode sentences for downstream tasks or run interactive [MASK] completion
- Provides unified cased/lowercased vocabularies for fair cross-model comparison
- Extensible to negated probes and LAMA-UHN variants for harder evaluation
Caveats
- Requires significant disk space (~55 GB) and manual model setup
- Single-token [MASK] gaps only; multi-word answers are out of scope
- Code targets Python 3.7 and older model versions; may need tweaks for current transformers
- CC-BY-NC 4.0 license restricts commercial use
Verdict Researchers studying knowledge extraction or model comparison should grab this. If you just need a quick BERT inference snippet, huggingface pipelines are lighter.