← all repositories

OpenGenerativeAI/llm-colosseum

A gaming-based benchmark that pits LLMs against each other in Street Fighter III to evaluate decision-making speed, strategy, and adaptability in real time.

1.5k stars Jupyter Notebook LLMOps · Eval
llm-colosseum
Velocity · 7d
+1.8
★ / day
Trend
steady
star history

The platform runs LLM-vs-LLM matches in Street Fighter III, measuring how well each model responds to game state, makes rapid decisions, and adapts strategy over a full match. Each model earns an ELO rating based on fight outcomes. The benchmark tracks criteria like decision speed, strategic depth, and resilience across hundreds of fights.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.