llm2014/llm_benchmark
A personal benchmarking system that tracks and evaluates large language models using a rolling private question bank.
★1.2k stars LLMOps · Eval

Velocity · 7d
+2.5
★ / day
Trend
→steady
star history
This repository maintains a systematic evaluation framework for LLMs, testing models on logic, mathematics, programming, and reasoning tasks. It uses a private question bank of 28 questions with 270 test cases, updated monthly, and evaluates models through official APIs or proxy services. Each model is tested three times with the highest score recorded, and scores are normalized to a 0-10 scale per question.