Is evalscope open source?

Yes — modelscope/evalscope is open source, released under the Apache-2.0 license.

What language is evalscope written in?

modelscope/evalscope is primarily written in Python.

How popular is evalscope?

modelscope/evalscope has 3.1k stars on GitHub and is currently cooling off.

Where can I find evalscope?

modelscope/evalscope is on GitHub at https://github.com/modelscope/evalscope.

← all repositories

modelscope/evalscope

One Pipeline to Benchmark LLMs, Agents, and Inference Speed

EvalScope bundles industry benchmarks, inference stress tests, and agent trajectory replay into a single evaluation workbench.

★3.1k stars Python LLMOps · Eval

View on GitHub ↗ Homepage ↗

Velocity · 7d

+5.0

★ / day

Trend

↘cooling

star history

What it does

EvalScope is a Python evaluation framework from the ModelScope Community that wraps capability benchmarks, inference stress tests, and result visualization into a single workbench. It supports LLMs, vision-language models, embeddings, rerankers, and AIGC evaluators, and integrates existing backends like OpenCompass and VLMEvalKit rather than reinventing them. You can run standard suites such as MMLU or GSM8K, measure TTFT and TPOT under load, and browse results in a React-based WebUI.

The interesting bit

Where it diverges from a typical benchmark harness is its deep focus on agentic evaluation: it can drive multi-turn tool-use loops inside Docker sandboxes, replay real agent traces for performance testing, and even intercept traffic from off-the-shelf CLIs like Claude Code or OpenAI Codex via an External Agent Bridge to benchmark them against your own model. It also records latency metrics during the same accuracy run, so you are not left guessing whether a high score comes from a slow model.

Key highlights

Bundles dozens of built-in benchmarks—MMLU, C-Eval, GSM8K, SWE-bench_Pro, GAIA, TIR-Bench, and more—spanning text, code, vision, audio, and RAG.
Agent Evaluation Mode runs pluggable strategies (function_calling, react, swe_bench_*) with tool calls and per-sample trace recording.
External Agent Bridge transparently forwards LLM traffic from commercial agent CLIs to your evaluation endpoint and records the full trajectory.
Vendor Verifier benchmarks check whether third-party API deployments faithfully reproduce official model behavior.
Combines accuracy and efficiency metrics in one pass, tracking TTFT, TPOT, and throughput alongside task scores.

Verdict

Teams running internal model comparisons or hosting their own inference should look here; it is especially useful if you are testing agent pipelines rather than just chat completions. If you only need a quick script to score a single CSV dataset, it is likely overkill.

Frequently asked

What is modelscope/evalscope?: EvalScope bundles industry benchmarks, inference stress tests, and agent trajectory replay into a single evaluation workbench.
Is evalscope open source?: Yes — modelscope/evalscope is open source, released under the Apache-2.0 license.
What language is evalscope written in?: modelscope/evalscope is primarily written in Python.
How popular is evalscope?: modelscope/evalscope has 3.1k stars on GitHub and is currently cooling off.
Where can I find evalscope?: modelscope/evalscope is on GitHub at https://github.com/modelscope/evalscope.