FranxYao/chain-of-thought-hub
Academic benchmark suite evaluating LLMs on complex reasoning tasks using chain-of-thought prompting across math, science, coding, and knowledge domains.

A research framework that systematically benchmarks large language models across diverse reasoning domains including mathematics, science, symbolic manipulation, coding, and factual accuracy. It provides standardized evaluation datasets and metrics to measure model performance on challenging tasks, with a focus on assessing chain-of-thought reasoning capabilities. The work includes implementations for multiple benchmarks like GSM8K, MATH, TheoremQA, BBH, MMLU, and HumanEval.