← all repositories

google-research/rliable

Python library for statistically rigorous evaluation of reinforcement learning and machine learning benchmarks using bootstrap confidence intervals and aggregate metrics.

870 stars Jupyter Notebook LLMOps · Eval
rliable
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

rliable provides tools for reliable evaluation on RL and ML benchmarks even with limited runs. It implements stratified bootstrap confidence intervals to quantify uncertainty in aggregate performance, and offers alternative aggregate metrics like Interquartile Mean that are more robust to outliers than simple means or medians. The library also supports performance profile visualizations showing score distributions across tasks and runs.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.