google-deepmind/long-form-factuality
A benchmark suite for measuring factual accuracy in long-form responses from large language models.

LongForm Factuality provides tools for evaluating how accurately large language models generate factual information in extended responses. It includes LongFact, a dataset of 2,280 fact-seeking prompts, and SAFE (Search-Augmented Factuality Evaluator), an automated evaluation system that assesses model responses against ground truth. The repository also introduces F1@K, a recall-based metric adapted for long-form settings, and provides a pipeline for benchmarking models from OpenAI and Anthropic.