Skip the audio: this transcriber loots YouTube subtitles first
A self-hosted tool that transcribes and summarizes videos by extracting existing subtitles before ever touching Whisper.

What it does Drop a YouTube, TikTok, or podcast URL (or a local audio/video file) into a web UI and get back a cleaned-up transcript, optional translation, and an AI summary. The whole thing runs locally as a FastAPI server with a vanilla-JS frontend.
The interesting bit The “subtitle-first architecture” is the quiet win: for platforms like YouTube that already have captions, it grabs those instantly and skips audio download + Whisper entirely. Only when no subtitles exist does it fall back to Faster-Whisper on normalized 16 kHz mono audio. That pipeline choice matters more than the model choice.
Key highlights
- Supports 30+ platforms via yt-dlp, plus local uploads (.mp3, .mp4, .txt, etc.)
- Bring-your-own-model: enter any OpenAI-compatible API base URL + key in the UI, click Fetch, and auto-discover available models
- Conditional translation: auto-detects when summary language ≠ source language and adds a Translation tab
- Docker Compose or
./install.shfor setup; runs on Python 3.8+ - Production mode (
--prod) keeps SSE connections alive for 30–60+ minute jobs
Caveats
- Requires FFmpeg and an OpenAI-compatible API key (no local LLM inference out of the box)
- Default Whisper model is
base; larger models get slow and memory-hungry fast - README notes HTTP 500 errors are “usually environment configuration issues” — suggests rough edges in error handling
Verdict Good fit if you want a private, self-hosted alternative to cloud transcription services and can tolerate some manual setup. Skip it if you need fully offline LLM inference or enterprise-grade error resilience.