← all repositories

THUDM/LongBench

A benchmark suite for evaluating LLMs on challenging long-context tasks requiring deep understanding and reasoning across single/multi-document QA, in-context learning, dialogue understanding, and code comprehension.

1.2k stars Python LLMOps · EvalLearning
LongBench
Velocity · 7d
+1.1
★ / day
Trend
steady
star history

LongBench v2 provides a comprehensive evaluation framework for assessing large language models on realistic long-context multitasks. It contains 503 multiple-choice questions with context lengths ranging from 8k to 2M words, covering six major task categories including single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repo understanding, and long structured data understanding. The benchmark is designed to be challenging enough that even human experts using search tools cannot answer quickly, ensuring meaningful evaluation of deep reasoning capabilities.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.