← all repositories
Minqi824/ADBench

98,436 experiments say your favorite anomaly detector probably isn't the best

A NeurIPS 2022 benchmark that runs 30 algorithms across 57 datasets to settle which anomaly detection methods actually work—and under what conditions they fall apart.

1k stars Python LLMOps · EvalData Tooling
ADBench
Velocity · 7d
+0.7
★ / day
Trend
steady
star history

What it does ADBench is a systematic testbed for tabular anomaly detection. It runs 30 algorithms (14 unsupervised, 7 semi-supervised, 9 supervised) through 98,436 experiments on 57 datasets, measuring performance across three angles: how much supervision helps, how different anomaly types trip up models, and how algorithms handle corrupted or noisy data. You can install it via pip install adbench, download datasets from a remote repo, and benchmark your own method with a few lines of Python.

The interesting bit The big surprise: no single unsupervised algorithm statistically dominates the others. Even weirder, with just 1% labeled anomalies, semi-supervised methods often beat the best unsupervised approach—yet in controlled settings, the right unsupervised method for a specific anomaly type can outperform fully supervised ones. The lesson is less “use X” and more “it depends, so test it.”

Key highlights

  • 57 datasets in unified .npz format, including 10 new CV/NLP-derived sets with pretrained embeddings
  • Three experimental angles: supervision level, anomaly type (local/global/dependency/cluster), and data corruption (duplicated anomalies, irrelevant features, label contamination)
  • RunPipeline class handles parallel execution and auto-exports results to CSV
  • Custom algorithm support: drop your model into the Customized baseline template
  • Maintained by the PyOD/TODS/PyGOD authors, so the ecosystem integration is real

Caveats

  • Datasets must be downloaded separately from the GitHub repo (or jihulab for mainland China users); not bundled in the pip package
  • The README’s dependency list is commented out, so you’ll need to check guidance.ipynb or source code for actual requirements
  • Multi-class datasets like CIFAR10 require special naming conventions (number_data_class.npz)

Verdict Worth your time if you’re publishing anomaly detection research or choosing a production model and don’t want to rely on folklore. Skip it if you already know your data is clean, your anomalies are purely local, and you’ve sworn a blood oath to a single algorithm.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.