A 2017 master thesis that glued YOLO to TensorBox
Video object tracking built from off-the-shelf parts for the ImageNet VID competition, with the rough edges left visible.

What it does
This is a master’s thesis project for video object tracking in TensorFlow, built to compete in ImageNet’s VID challenge. It chains together existing open-source implementations: YOLO or TensorBox for detection, hand-rolled post-processing for temporal smoothing, and Inception for classification. You feed it a video file; it spits out an annotated MP4.
The interesting bit
The architecture explicitly copies the T-CNN paper’s cascade strategy—detection first, then temporal information, then context—but replaces the trainable temporal components with non-trainable post-processing algorithms in Utils_Video.py. It’s a frankenstein that admits it’s a frankenstein.
Key highlights
- Supports YOLO (single-class detection) and TensorBox (multi-class) pipelines
- Includes dataset preprocessing scripts with brute-force and lightweight modes
- Provides trained weights for both Inception and TensorBox via Mega.nz links
- Hardcoded to 640×480 PNG for TensorBox; you’ll need to hack the resize scripts for other dimensions
- Requires OpenCV, TensorFlow, and Python; installation guide included
Caveats
- Temporal information is “retrieved through some post processing algorithm… NOT TRAINABLE” — the README’s own caps
- Weight download links are from 2017 (Mega.nz); longevity unclear
- Early output had frame-ordering bugs causing flicker; author notes this was “then solved” but the fix isn’t detailed
- “I will soon put a weight file to download” — still pending as of last update 10-03-2017
Verdict
Worth a look if you’re studying how to bolt together 2016-era detection models for video, or if you need a concrete (if dated) reference for the ImageNet VID pipeline. Skip it if you want a maintained, trainable end-to-end tracker—this is explicitly thesis-grade glue code with the scaffolding still showing.