Meta's vision model that sees forests and street markets without fine-tuning
DINOv3 is a family of self-supervised vision backbones designed to produce high-quality dense features for everything from semantic segmentation to satellite canopy mapping, often beating task-specialized models out of the box.

What it does
DINOv3 provides pretrained vision transformers and ConvNeXt backbones that output dense, high-resolution features for images. The models come in sizes from 21M to 6.7B parameters, trained on either web-scale data (LVD-1689M) or satellite imagery (SAT-493M). Meta ships reference PyTorch code plus adapters for linear probing on tasks like semantic segmentation (ADE20K), depth estimation (NYUv2-Depth), and canopy height mapping.
The interesting bit
The pitch is “without fine-tuning” — these are foundation models in the original sense, meant to work as frozen feature extractors. The CHMv2 release is a nice flex: a 7B-parameter ViT pretrained on satellite data, repurposed for global forest canopy height mapping, with weights on Hugging Face and integration into the Transformers library.
Key highlights
- ViT and ConvNeXt variants from tiny (21M) to 7B parameters, all trained with self-supervised distillation
- Two pretraining domains: general web images (LVD-1689M) and satellite imagery (SAT-493M)
- Supported by PyTorch Hub, Hugging Face Transformers (≥4.56.0), and timm (≥1.0.20)
- Released task code: linear segmentation, depth estimation, and canopy height inference
- Model weights require an access request via Meta’s download portal;
wgetrecommended over browser downloads
Caveats
- Weight downloads are gated behind a request form, not directly fetchable
- The README is heavy on model release announcements and light on training methodology or architecture details
Verdict
Worth a look if you need strong frozen visual features and don’t want to fine-tune a CLIP variant. Skip if you were hoping for an open-weights, no-gatekeeping drop-in replacement — the access friction is real.