GPT-4 picks the clips, FFmpeg does the cutting
A Jupyter notebook that downloads YouTube videos, asks GPT-4 to find the "viral" bits in the transcript, then crops around faces for vertical shorts.

What it does
Feed it a YouTube URL. It downloads the video, pulls the transcript via youtube-transcript-api, ships that text to GPT-4, and asks the model to identify timestamps worth keeping. FFmpeg then slices those segments, and OpenCV’s face detection tries to center the crop on whoever’s talking. The output is a stack of vertical clips ready for TikTok or Shorts.
The interesting bit
The face-detection pipeline doesn’t just find faces—it includes a is_talking_in_batch() function that attempts to detect lip movement or facial muscle activity to decide who to focus on. That’s a nice touch for multi-speaker videos where simple face detection would ping-pong between subjects.
Key highlights
- GPT-4 analyzes transcripts, not raw video—so it’s cheap on tokens but blind to visual action
- Face-aware cropping with FFmpeg, not just center-crop
- Lip-movement detection to guess who’s speaking
- One-shot Jupyter notebook workflow: URL in, clips out
- Explicitly marked WIP with a bug warning in the README
Caveats
- The README admits the “GPT-4 model and transcript analysis functionality… are simulated and not fully functional” without a valid API key
- No actual code shown in the README; you’re trusting the description matches
auto_cropper.py - Requires manual API key injection and hardcoded
video_idediting - “Viral” is whatever GPT-4 thinks it means—no training data or validation shown
Verdict Worth a look if you’re building a shorts pipeline and want a starting point for transcript-driven editing. Skip it if you need production reliability or visual-aware highlight detection—this is text-first, video-second, and explicitly unfinished.