← all repositories

FranxYao/chain-of-thought-hub

Academic benchmark suite evaluating LLMs on complex reasoning tasks using chain-of-thought prompting across math, science, coding, and knowledge domains.

2.8k stars Jupyter Notebook LLMOps · EvalLearning
chain-of-thought-hub
Velocity · 7d
+2.3
★ / day
Trend
steady
star history

A research framework that systematically benchmarks large language models across diverse reasoning domains including mathematics, science, symbolic manipulation, coding, and factual accuracy. It provides standardized evaluation datasets and metrics to measure model performance on challenging tasks, with a focus on assessing chain-of-thought reasoning capabilities. The work includes implementations for multiple benchmarks like GSM8K, MATH, TheoremQA, BBH, MMLU, and HumanEval.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.