A Chinese-language field guide to video object detection
A curated survey repo that explains why single-frame detection fails on video—and how temporal context fixes it.

What it does
This repo is a living literature review of video object detection, written in Chinese. It collects papers, datasets, and implementation notes, organized around a core insight: video gives you temporal context that still-image detectors waste or ignore.
The interesting bit
The author doesn’t just list papers—they explain the actual engineering trade-offs. Two camps emerge: one uses motion information (optical flow, tubelet rescoring) to speed up detection by reusing features across frames; the other fuses temporal context to improve accuracy when frames are blurry, occluded, or poorly scaled. The MSRA work on flow-guided feature warping gets singled out as “cleaner” and the only end-to-end trainable approach at the time.
Key highlights
- Surveys CUHK and MSRA research lines with enough detail to grasp the methodological differences
- Catalogs video detection datasets: ImageNet VID, YouTube-Objects, YouTube-BoundingBoxes
- Includes practical tooling: mAP evaluation references, Faster R-CNN and Cascade R-CNN links
- Explicitly flags Seq-NMS as a small, self-contained module worth trying first
- Links to a Zhihu discussion that frames the field better than most English intros
Caveats
- No original code implementations here—this is a reading list and note collection, not a framework
- README is mostly prose and paper titles; some sections trail off with “TODO” or broken image links
- Several dataset description paragraphs are truncated mid-sentence
Verdict
Worth bookmarking if you’re entering video detection and read Chinese. Skip it if you need runnable code or prefer English-only resources; treat it as a curated syllabus, not a codebase.