Pandas meets scikit-learn without the duct tape
A library that finally treats messy dataframes as first-class citizens in ML pipelines.

What it does
skrub bridges the awkward gap between raw, messy dataframes and scikit-learn’s tidy numerical world. It provides transformers and tools that handle dirty categorical data, text, and other real-world column types without forcing you into a preprocessing rabbit hole.
The interesting bit
The project evolved from dirty_cat, a focused tool for encoding messy categories, into something broader: making entire dataframes ML-ready. The name change signals ambition beyond just cleaning up strings.
Key highlights
- Built specifically for pandas-like dataframes, not as an afterthought
- Handles “dirty” categorical data (typos, inconsistencies, rare categories) that standard encoders choke on
- Integrates with scikit-learn pipelines without custom glue code
- Active community with Discord, learning materials, and example galleries
- 1,618 stars and steady development under the skrub-data org
Caveats
- The README is thin on specifics; you’ll need to dig into the website and examples to understand actual capabilities
- Formerly dirty_cat — some documentation and Stack Overflow answers may still reference the old name
Verdict
Worth a look if you spend more time wrestling data into shape than training models. Skip it if your data is already clean numerical matrices or you live entirely in deep-learning frameworks.