← all repositories

llm2014/llm_benchmark

A personal benchmarking system that tracks and evaluates large language models using a rolling private question bank.

1.2k stars LLMOps · Eval
llm_benchmark
Velocity · 7d
+2.5
★ / day
Trend
steady
star history

This repository maintains a systematic evaluation framework for LLMs, testing models on logic, mathematics, programming, and reasoning tasks. It uses a private question bank of 28 questions with 270 test cases, updated monthly, and evaluates models through official APIs or proxy services. Each model is tested three times with the highest score recorded, and scores are normalized to a 0-10 scale per question.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.