THUDM/AlignBench
A comprehensive multi-dimensional benchmark for evaluating Chinese large language model alignment using LLM-as-Judge methodology.

AlignBench is a benchmark designed to evaluate the alignment performance of Chinese large language models. It employs a multi-dimensional, rule-calibrated LLM-as-Judge evaluation approach combined with Chain-of-Thought reasoning to generate analysis and final scores. The benchmark includes a human-involved data construction pipeline to ensure dynamic updates of evaluation data and covers multiple evaluation dimensions to assess real-world model performance.