google-research/rliable
Python library for statistically rigorous evaluation of reinforcement learning and machine learning benchmarks using bootstrap confidence intervals and aggregate metrics.

rliable provides tools for reliable evaluation on RL and ML benchmarks even with limited runs. It implements stratified bootstrap confidence intervals to quantify uncertainty in aggregate performance, and offers alternative aggregate metrics like Interquartile Mean that are more robust to outliers than simple means or medians. The library also supports performance profile visualizations showing score distributions across tasks and runs.