YOLO on a diet: shrinking detection models by force
A Chinese-language repo that squeezes YOLOv3/v4 through channel pruning, layer pruning, and knowledge distillation for edge deployment.

What it does This project takes Ultralytics’ YOLOv3/v4 implementation and puts it through a multi-stage compression pipeline. You train normally, then run “sparse training” to crush batch-normalization gamma coefficients toward zero, then prune channels or entire shortcut layers based on those coefficients. A final finetune stage recovers accuracy, optionally guided by knowledge distillation from the original fat model.
The interesting bit The repo doesn’t pick one pruning strategy—it implements three competing channel-pruning approaches (conservative shortcut-avoiding, mask-sharing, and union-mask) plus a derived layer-pruning strategy that carves out entire shortcut blocks. The author also added two knowledge-distillation strategies: a basic Hinton-style classification distillation and a detection-specific variant where the student only learns from the teacher when the teacher is closer to the ground-truth target than the student is.
Key highlights
- Supports YOLOv3, YOLOv3-SPP, YOLOv3-tiny, YOLOv4, and YOLOv4-tiny
- Three sparse-training schedules: constant penalty, global decay at 50% epochs, or local decay on the least-important 15% of channels
- Layer pruning removes shortcut blocks (up to 48 of 69 eligible layers in YOLOv3) for speed gains beyond what channel pruning alone achieves
- Knowledge distillation via
--t_cfgand--t_weightsflags during finetuning - Mixed-precision training via NVIDIA Apex for faster iteration
Caveats
- README and code comments are entirely in Chinese; English speakers will need translation help
- Several “strategies” require uncommenting hardcoded lines in source files rather than command-line flags
- The author notes that sparse training is “the top priority” and that finding the right penalty coefficient
stakes significant trial and error
Verdict Worth a look if you’re shipping YOLO to resource-constrained hardware and can invest time in tuning the sparse-training hyperparameters. Skip it if you need a polished, one-command solution or don’t read Chinese.