← all repositories

stanford-crfm/helm

An open-source Python framework by Stanford CRFM for holistic, reproducible evaluation of foundation models including LLMs and multimodal models.

2.8k stars Python LLMOps · EvalData Tooling
helm
Velocity · 7d
+1.7
★ / day
Trend
steady
star history

HELM provides a standardized evaluation framework for assessing language models and multimodal systems. It includes curated datasets and benchmarks such as MMLU-Pro, GPQA, IFEval, and WildBench in a standardized format. The framework supports models from multiple providers and enables transparent, reproducible benchmarking of foundation model capabilities.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.