Depth estimation that actually runs on your laptop
A monocular depth model that trades Stable Diffusion's bulk for DINOv2's backbone, shipping four sizes from 25M to 1.3B parameters.

What it does
Depth Anything V2 turns a single RGB image into a depth map. No stereo rig, no LiDAR, no fuss. The repo ships four model sizes (Small through Giant), plus scripts for batch images, video, and a local Gradio demo. You can also load it via Hugging Face Transformers if you don’t want the full checkout.
The interesting bit
The authors accidentally used the wrong DINOv2 features in V1—last four layers instead of intermediate ones—and fixed it in V2. They admit this “did not improve details or accuracy,” which is either refreshing honesty or a testament to how robust the underlying approach already was. The real win is speed: they claim faster inference and fewer parameters than SD-based depth models, with four scales to match your GPU budget.
Key highlights
- Four checkpoints: 24.8M, 97.5M, 335.3M, and 1.3B parameters (Giant “coming soon”)
- Native video support with larger models buying you better temporal consistency
- Metric depth fine-tuning available in a separate subdirectory
- Apple Core ML, TensorRT, ONNX, and Android ports already exist in the community
- Small model is Apache-2.0; Base/Large/Giant are CC-BY-NC-4.0 (non-commercial)
Caveats
- The 1.3B Giant model is still unreleased
- Hugging Face Transformers integration exists but predictions differ slightly from the native path due to OpenCV vs. Pillow upsampling
Verdict
Grab this if you need off-the-shelf depth maps without orchestrating a diffusion pipeline. Skip it if you need guaranteed metric accuracy out of the box—fine-tuning or LiDAR prompting (see their Prompt Depth Anything follow-up) is required for that.