eval-sys/mcpmark
A benchmark suite that stress-tests AI agents by executing one-command tasks across real MCP tool environments with isolated sandboxes.

MCPMark provides a reproducible evaluation framework for measuring AI agent performance in real-world MCP tool use scenarios. The benchmark runs agents against tasks in environments including Notion, GitHub, Filesystem, Postgres, and Playwright, with isolated sandbox execution and automatic recovery for failures. It generates unified metrics and aggregated reports for comparing model capabilities, and supports trajectory logging to Hugging Face datasets.