← all repositories

eval-sys/mcpmark

A benchmark suite that stress-tests AI agents by executing one-command tasks across real MCP tool environments with isolated sandboxes.

423 stars Python LLMOps · EvalAgents
mcpmark
Velocity · 7d
+1.2
★ / day
Trend
steady
star history

MCPMark provides a reproducible evaluation framework for measuring AI agent performance in real-world MCP tool use scenarios. The benchmark runs agents against tasks in environments including Notion, GitHub, Filesystem, Postgres, and Playwright, with isolated sandbox execution and automatic recovery for failures. It generates unified metrics and aggregated reports for comparing model capabilities, and supports trajectory logging to Hugging Face datasets.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.