Is datacleaner open source?

Yes — rhiever/datacleaner is open source, released under the MIT license.

What language is datacleaner written in?

rhiever/datacleaner is primarily written in Python.

How popular is datacleaner?

rhiever/datacleaner has 1.1k stars on GitHub.

Where can I find datacleaner?

rhiever/datacleaner is on GitHub at https://github.com/rhiever/datacleaner.

← all repositories

rhiever/datacleaner

A janitor for pandas: fill, encode, and move on

datacleaner automates the three chores every data scientist repeats before the real work starts.

★1.1k stars Python Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

datacleaner takes a pandas DataFrame and applies a fixed, opinionated cleaning pipeline: drop rows with missing values (optional), fill remaining NaNs with column-wise mode or median, and encode categorical strings as numbers. It exposes both a Python API (autoclean, autoclean_cv) and a CLI for quick one-off jobs. The output is still a plain DataFrame, so you keep full access to pandas operations afterward.

The interesting bit

The autoclean_cv variant is the quiet win: it learns imputation and encoding parameters from the training set only, then applies them to both splits. That prevents the classic cross-validation leak where your test set whispers its medians to the model ahead of time. For a tool this small, that level of care is notable.

Key highlights

CLI and Python API both supported; pip-installable
autoclean_cv explicitly guards against information leakage between train and test sets
Accepts custom category_encoders transformers instead of defaulting to LabelEncoder
Works directly on pandas DataFrames without wrapping or hiding them
MIT licensed and citable via Zenodo DOI

Caveats

The README is upfront that this “is not magic”: it will not parse unstructured text or rescue malformed files
Feature set is deliberately narrow; the authors note they “plan to add more cleaning features as the project grows,” so check whether current capabilities match your needs
Default input/output separator is tab (\t), which has tripped more than one CSV user

Verdict

Worth a look if you train enough models to be bored of writing the same preprocessing scaffolding. Skip it if you need heavy-duty parsing, outlier detection, or feature engineering — this is strictly the first five minutes of a notebook, automated.

Frequently asked

What is rhiever/datacleaner?: datacleaner automates the three chores every data scientist repeats before the real work starts.
Is datacleaner open source?: Yes — rhiever/datacleaner is open source, released under the MIT license.
What language is datacleaner written in?: rhiever/datacleaner is primarily written in Python.
How popular is datacleaner?: rhiever/datacleaner has 1.1k stars on GitHub.
Where can I find datacleaner?: rhiever/datacleaner is on GitHub at https://github.com/rhiever/datacleaner.