← all repositories
dformoso/sklearn-classification

A census-taker's notebook: income prediction, step by step

A single Jupyter notebook that walks through the full sklearn pipeline on a classic dataset, with a Docker one-liner to get you running.

692 stars Jupyter Notebook LearningML Frameworks
sklearn-classification
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does This repo is one long Jupyter notebook that predicts whether someone earns more than $50K/year using the UCI Census Income dataset. It runs through the standard sklearn hits: exploration, imputation, encoding, feature ranking, then trains models with both sklearn and TensorFlow. A companion mindmap (separate repo) maps out the broader data science workflow.

The interesting bit The Docker setup is almost aggressively simple—one docker run command and you’re at localhost:8888 with TensorFlow and Jupyter ready. For a field where environment setup eats half a day, that’s not nothing.

Key highlights

  • Covers the full pipeline: univariate/bivariate exploration, imputation, selection, encoding, PCA, and model comparison
  • Includes ROC curves and metric calculations (accuracy, precision, recall, f1) for algorithm comparison
  • Designed to run on the official jupyter/tensorflow-notebook Docker image
  • Companion mindmap/cheatsheet at dformoso/machine-learning-mindmap
  • ~700 stars suggests it has served as a reference for others learning the sklearn workflow

Caveats

  • The README is a walkthrough, not a library—expect copy-paste learning, not import-and-go code
  • No requirements.txt or pip install instructions; Docker is the only documented path
  • TensorFlow usage is mentioned but not detailed in the README; unclear if it’s a full alternative pipeline or a brief add-on

Verdict Good for someone who wants to see a complete, documented sklearn workflow in one place and prefers learning by running cells. Skip if you need modular, reusable code or are already comfortable building pipelines from scratch.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.