← all repositories
rhiever/datacleaner

A janitor for pandas: fill, encode, and move on

datacleaner automates the three chores every data scientist repeats before the real work starts.

1.1k stars Python Data Tooling
datacleaner
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

datacleaner takes a pandas DataFrame and applies a fixed, opinionated cleaning pipeline: drop rows with missing values (optional), fill remaining NaNs with column-wise mode or median, and encode categorical strings as numbers. It exposes both a Python API (autoclean, autoclean_cv) and a CLI for quick one-off jobs. The output is still a plain DataFrame, so you keep full access to pandas operations afterward.

The interesting bit

The autoclean_cv variant is the quiet win: it learns imputation and encoding parameters from the training set only, then applies them to both splits. That prevents the classic cross-validation leak where your test set whispers its medians to the model ahead of time. For a tool this small, that level of care is notable.

Key highlights

  • CLI and Python API both supported; pip-installable
  • autoclean_cv explicitly guards against information leakage between train and test sets
  • Accepts custom category_encoders transformers instead of defaulting to LabelEncoder
  • Works directly on pandas DataFrames without wrapping or hiding them
  • MIT licensed and citable via Zenodo DOI

Caveats

  • The README is upfront that this “is not magic”: it will not parse unstructured text or rescue malformed files
  • Feature set is deliberately narrow; the authors note they “plan to add more cleaning features as the project grows,” so check whether current capabilities match your needs
  • Default input/output separator is tab (\t), which has tripped more than one CSV user

Verdict

Worth a look if you train enough models to be bored of writing the same preprocessing scaffolding. Skip it if you need heavy-duty parsing, outlier detection, or feature engineering — this is strictly the first five minutes of a notebook, automated.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.