InternLM/WildClawBench
An in-the-wild benchmark suite that evaluates AI agents by dropping them into a live OpenClaw environment and testing them on 60 hard, practical tasks.

WildClawBench provides a rigorous end-to-end evaluation framework for AI agents using the OpenClaw personal assistant environment. It includes 60 original tasks spanning real-world scenarios such as extracting highlights from videos, negotiating over email, finding contradictions in search results, writing inference scripts, and catching privacy leaks. The benchmark includes multiple evaluation harnesses, tracks results on a public leaderboard, and is designed so that even the strongest frontier models only achieve around 62% accuracy, ensuring scores carry meaning.