stanford-crfm/helm
An open-source Python framework by Stanford CRFM for holistic, reproducible evaluation of foundation models including LLMs and multimodal models.

Velocity · 7d
+1.7
★ / day
Trend
→steady
star history
HELM provides a standardized evaluation framework for assessing language models and multimodal systems. It includes curated datasets and benchmarks such as MMLU-Pro, GPQA, IFEval, and WildBench in a standardized format. The framework supports models from multiple providers and enables transparent, reproducible benchmarking of foundation model capabilities.