← all repositories
juliandewit/kaggle_ndsb2017

A $100K bug that won silver

Second-place Kaggle lung cancer code, warts and all — including the overlooked data subset the author left in for reproducibility.

626 stars Python Computer VisionDomain Apps
kaggle_ndsb2017
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

This is the 2nd-place solution to the 2017 National Data Science Bowl, a lung cancer detection competition on Kaggle. It chains together nodule detectors, malignancy regressors, a U-net mass segmenter, and XGBoost blending to predict cancer probability from 3D CT scans. The pipeline runs from DICOM preprocessing through neural net training to a four-part averaged final submission.

The interesting bit

The author deliberately preserved a bug that skipped 10% of the LUNA16 training data — “a 100.000 dollar mistake” — because fixing it would break reproducibility of the silver-medal result. That tension between clean code and scientific honesty is unusually visible here. The solution is also a genuine team effort: this repo covers two of four blended models, with teammate Daniel Hammack’s contributions linked separately.

Key highlights

  • 3D convnets for nodule detection/malignancy regression, trained on LUNA16 plus LIDC malignancy labels
  • U-net mass segmenter for suspicious non-nodule tissue, trained on manual annotations
  • ~10 hours training per nodule detector, ~8 hours total for 3-fold mass segmenter
  • Pretrained models available via direct download; preprocessing scripts generate 1mm-isotropic PNG slices and lung masks
  • Final blend: simple average of four model families (two from this repo, two from Hammack)

Caveats

  • Windows 64-bit, specific dependency stack (Keras/TensorFlow, SimpleITK, pydicom, XGBoost, OpenCV)
  • Requires manual labels, generated labels, and teammate’s submission files in the resources folder
  • Raw patient data must be fetched separately from Kaggle and LUNA16 websites
  • Some “pieces could be a bit cleaner”; author admits to leaving in bugs found during cleanup

Verdict

Worth studying if you’re building medical imaging pipelines or care about how competition code actually looks — messy, branched, and dependent on manual orchestration. Skip it if you want a turnkey lung cancer classifier; this is archival research code, not a product.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.