sierra-research/tau2-bench
τ-Bench is a benchmark for evaluating AI agents' performance in tool-calling and multi-turn user interaction across real-world domains like banking and retail.

τ-Bench is an evaluation framework designed to assess how well AI agents handle tool-use and user conversations in realistic service scenarios. It provides standardized test cases across domains (banking, retail, airline) where agents must retrieve information, execute actions, and maintain coherent multi-turn dialogues. The benchmark measures task completion accuracy, efficiency, and conversation quality through automated evaluation metrics.