← all repositories

google-deepmind/long-form-factuality

A benchmark suite for measuring factual accuracy in long-form responses from large language models.

686 stars Python LLMOps · EvalData Tooling
long-form-factuality
Velocity · 7d
+0.9
★ / day
Trend
steady
star history

LongForm Factuality provides tools for evaluating how accurately large language models generate factual information in extended responses. It includes LongFact, a dataset of 2,280 fact-seeking prompts, and SAFE (Search-Augmented Factuality Evaluator), an automated evaluation system that assesses model responses against ground truth. The repository also introduces F1@K, a recall-based metric adapted for long-form settings, and provides a pipeline for benchmarking models from OpenAI and Anthropic.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.