Ctrl+F for video: find moments without timestamps or tags
A single Jupyter notebook that lets you grep through YouTube videos with plain English instead of scrubbing timelines.

What it does Downloads a YouTube video, samples frames at intervals, and encodes them with OpenAI’s CLIP model. You type a query like “a fire truck” or “waiting at the red light”; the notebook ranks frames by semantic similarity and returns the best matches. No manual tagging, no transcript required.
The interesting bit The trick is treating video search as image retrieval by brute-force sampling. CLIP’s joint image-text embedding makes the matching possible; the author’s insight was simply applying it to uniformly-spaced frames and wrapping the whole pipeline in a runnable Colab notebook.
Key highlights
- Single
.ipynbfile — run directly in Google Colab, no local setup - Uses CLIP for zero-shot search: no training on your video needed
- Hugging Face Spaces + Gradio demo available (community integration)
- Companion project extends the same approach to 2M Unsplash photos
- Example queries show surprising specificity: “The Embarcadero,” “green bike lane,” “Transamerica Pyramid”
Caveats
- Frame sampling is fixed-interval; brief events between sampled frames are missed
- README doesn’t specify sampling rate, compute requirements, or runtime for long videos
- Requires downloading the full video before analysis; no streaming or incremental processing
Verdict Useful for researchers, journalists, or hobbyists who need to locate visual moments in long videos without watching them. Not a production system — more like a clever proof-of-concept that happens to work well enough for demos.