Object detection that reads English
A vision-language detector that finds objects in images using plain text prompts, no retraining required.

What it does
Grounding DINO takes an image and a text prompt—say, “two dogs with a stick”—and returns bounding boxes for the objects you described. It was trained to align visual features with language, so it generalizes to categories it never saw during training. The repo provides PyTorch inference code and pretrained checkpoints.
The interesting bit
The model outputs 900 candidate boxes per image, then scores each box against every word in your prompt. You threshold both the box confidence and the text-word similarity to decide what counts as a match. This means you can fish for specific phrases (“dogs” but not “stick”) without retraining the network.
Key highlights
- Zero-shot COCO performance of 52.5 AP without training on COCO; 63.0 AP with fine-tuning
- CPU-only inference supported since March 2023
- Hugging Face integration available; also Colab demos and a web demo
- Ecosystem includes Grounded SAM (segmentation), Stable Diffusion image editing, and GLIGEN integration
- Training code not yet released (still on the TODO list)
Caveats
- Installation is finicky: missing
CUDA_HOMEcauses a crypticNameError: name '_C' is not definedthat requires a full reinstall - The README warns that tokenizer behavior varies—word count ≠ token count, and category names should be separated by periods for best results
Verdict
Worth a look if you need flexible, promptable detection without curating a fixed class list. Skip it if you need training code today or want a polished pip-install experience.