A janitor for pandas: fill, encode, and move on
datacleaner automates the three chores every data scientist repeats before the real work starts.

What it does
datacleaner takes a pandas DataFrame and applies a fixed, opinionated cleaning pipeline: drop rows with missing values (optional), fill remaining NaNs with column-wise mode or median, and encode categorical strings as numbers. It exposes both a Python API (autoclean, autoclean_cv) and a CLI for quick one-off jobs. The output is still a plain DataFrame, so you keep full access to pandas operations afterward.
The interesting bit
The autoclean_cv variant is the quiet win: it learns imputation and encoding parameters from the training set only, then applies them to both splits. That prevents the classic cross-validation leak where your test set whispers its medians to the model ahead of time. For a tool this small, that level of care is notable.
Key highlights
- CLI and Python API both supported; pip-installable
autoclean_cvexplicitly guards against information leakage between train and test sets- Accepts custom
category_encoderstransformers instead of defaulting to LabelEncoder - Works directly on pandas DataFrames without wrapping or hiding them
- MIT licensed and citable via Zenodo DOI
Caveats
- The README is upfront that this “is not magic”: it will not parse unstructured text or rescue malformed files
- Feature set is deliberately narrow; the authors note they “plan to add more cleaning features as the project grows,” so check whether current capabilities match your needs
- Default input/output separator is tab (
\t), which has tripped more than one CSV user
Verdict
Worth a look if you train enough models to be bored of writing the same preprocessing scaffolding. Skip it if you need heavy-duty parsing, outlier detection, or feature engineering — this is strictly the first five minutes of a notebook, automated.