← all repositories
Huanshere/VideoLingo

Your AI subtitle team that actually cares about line breaks

A Streamlit app that downloads, transcribes, translates, and dubs videos while insisting on single-line subtitles like a Netflix QC editor.

VideoLingo
Velocity · 7d
+26
★ / day
Trend
steady
star history

What it does VideoLingo is a Python pipeline wrapped in Streamlit that takes a video (often from YouTube via yt-dlp), runs WhisperX for word-level transcription, translates with an LLM, and optionally dubs via TTS. The selling point is obsessive subtitle formatting: NLP segmentation, a three-step “Translate-Reflect-Adapt” loop, and a hard rule for single-line subtitles only. It also supports voice cloning through GPT-SoVITS and other TTS backends.

The interesting bit Most subtitle tools let the LLM dump text and call it done. VideoLingo treats subtitle structure as a first-class problem—segmentation, terminology glossaries, and reflection steps are all engineered around readable, cinematic captions. The dubbing then tries to match speech rates to those timings, which is where most “AI dubbing” demos fall apart.

Key highlights

  • Word-level WhisperX alignment with wav2vac, plus a voice-separation toggle for noisy sources
  • Custom or AI-generated terminology lists to keep translations consistent
  • TTS options span Azure, OpenAI, Fish TTS, GPT-SoVITS, Edge-TTS, and a plug-in custom_tts.py
  • One-click setup via uv (Python 3.10 auto-fetched); Docker with CUDA 12.4 also available
  • Streamlit UI with pause/resume/stop controls and progress logging
  • Can run fully local (Ollama + Edge-TTS) or via 302.ai for cloud LLM/Whisper/TTS

Caveats

  • WhisperX’s wav2vac alignment truncates subtitles ending in numbers or special characters, since it can’t map “1” to spoken “one”
  • Weaker LLMs may choke on strict JSON response requirements; the author notes you must delete the output folder and retry, or cached bad responses will loop forever
  • Multilingual videos keep only the dominant language; multiple speakers can’t be dubbed separately because speaker diarization isn’t reliable enough
  • Dubbing quality varies with speech-rate differences across languages

Verdict Worth a look if you produce localized video content and are tired of fixing broken line wraps by hand. Skip it if you need reliable multilingual source handling or broadcast-grade speaker separation out of the box.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.