← all repositories
skrub-data/skrub

Pandas meets scikit-learn without the duct tape

A library that finally treats messy dataframes as first-class citizens in ML pipelines.

1.6k stars Python Data Tooling
skrub
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

skrub bridges the awkward gap between raw, messy dataframes and scikit-learn’s tidy numerical world. It provides transformers and tools that handle dirty categorical data, text, and other real-world column types without forcing you into a preprocessing rabbit hole.

The interesting bit

The project evolved from dirty_cat, a focused tool for encoding messy categories, into something broader: making entire dataframes ML-ready. The name change signals ambition beyond just cleaning up strings.

Key highlights

  • Built specifically for pandas-like dataframes, not as an afterthought
  • Handles “dirty” categorical data (typos, inconsistencies, rare categories) that standard encoders choke on
  • Integrates with scikit-learn pipelines without custom glue code
  • Active community with Discord, learning materials, and example galleries
  • 1,618 stars and steady development under the skrub-data org

Caveats

  • The README is thin on specifics; you’ll need to dig into the website and examples to understand actual capabilities
  • Formerly dirty_cat — some documentation and Stack Overflow answers may still reference the old name

Verdict

Worth a look if you spend more time wrestling data into shape than training models. Skip it if your data is already clean numerical matrices or you live entirely in deep-learning frameworks.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.