TensorFlow CVPR paper, rewritten for PyTorch die-hards
A faithful PyTorch port of SfMLearner that trains faster and finally lets you validate against ground truth.

What it does Takes monocular video and learns to estimate depth maps and camera ego-motion simultaneously, with no depth labels required. It is a PyTorch reimplementation of the CVPR 2017 oral paper by Zhou et al., plus training and evaluation scripts for KITTI and Cityscapes.
The interesting bit The author didn’t just transliterate TensorFlow to PyTorch. He rethought the data pipeline so frame stacking happens on the fly rather than ahead of time, which cut training step time from ~0.20 s to ~0.14 s on a GTX 980 Ti. He also discovered—empirically, and somewhat mysteriously—that applying smoothness loss to depth instead of disparity actually makes the network converge, while the official formulation did not.
Key highlights
- On-the-fly sequence stacking eliminates pre-processing for specific sequence lengths.
- Validation can now use actual ground-truth depth, revealing that photometric loss minimization ≠ depth optimization.
- Pretrained weights and KITTI depth/pose benchmarks included (Abs Rel 0.181, ATE 0.0179 on Seq. 09).
- Supports both KITTI Raw and Cityscapes; pose evaluation computes ATE and rotation error explicitly.
- GitHub Issues are used as a surprisingly readable discussion forum on scale ambiguity, static-frame filtering, and inverse-warp direction.
Caveats
- Requires PyTorch ≥ 1.0.1; older versions need a tagged commit.
- The
2.3downscaling loss divisor and depth-based smooth loss are “empiric experiments,” not theoretically derived, and trade off pose accuracy for depth accuracy. - Cityscapes data access requires contacting dataset administrators.
Verdict Grab this if you need a working, hackable PyTorch baseline for unsupervised depth-from-video and want to avoid 2017-era TensorFlow tooling. Skip it if you need production-ready SLAM or are allergic to hand-tuned loss coefficients.