Teaching neural nets to see like 1981, but with more CUDA
A ConvNet that learns to estimate depth and camera motion from just two images, replacing classical geometry with learned intuition.

What it does
DeMoN takes two photographs of the same scene and predicts two things: how far away everything is (depth), and how the camera moved between shots. It’s the classic “structure from motion” two-view problem, but solved with a convolutional network rather than the hand-tuned geometric pipelines of old.
The interesting bit
The authors cheekily borrow their subtitle from a 1981 paper by H. C. Longuet-Higgins — the foundational work on two-view reconstruction — then note that their network learns those “complex geometric relations” implicitly from data. The old guard did it with epipolar geometry and matrix decompositions; DeMoN does it with backprop and enough training data.
Key highlights
- Estimates both depth maps and relative camera pose (rotation + translation) jointly
- Published at CVPR 2017 by the Freiburg computer vision lab
- Ships with Docker support and a pretrained weights download script
- Includes evaluation code and (work-in-progress) TensorFlow training code ported from the original Caffe implementation
- Depends on a custom ops library (
lmbspecialops) included as a submodule
Caveats
- Dependency stack is frozen in 2017 amber: TensorFlow 1.4, CUDA 8, Python 3.5, VTK 7.1
- VTK’s python3 interface requires building from source or hunting through Anaconda; the binary release won’t cut it
- The TensorFlow training code is explicitly noted as “work in progress” and not identical to the original Caffe model
- Dataset download scripts had a bug where
rgbd-prefixed files leaked test samples; fixed versions usergbd_bugfixprefix
Verdict
Worth a look if you’re researching learned multi-view geometry or need a baseline for depth-from-stereo-with-motion. Skip it if you need production-ready code on modern TensorFlow — this is a research artifact with the dependency scars to prove it.