← all repositories
bravekingzhang/text2video

Turn a novel into a slideshow with a voiceover

A Chinese developer's weekend project that splits text by periods, generates images per sentence, and syncs them to TTS audio.

text2video
Velocity · 7d
+1.0
★ / day
Trend
steady
star history

What it does

Feed it a block of text and it spits out an MP4. The pipeline is deliberately crude: split on Chinese periods, generate one image per sentence via Stable Diffusion (through Hugging Face or pollinations.ai), synthesize speech with edge-tts, then let the audio duration dictate how long each static image lingers. Subtitles are burned in with OpenCV. The author calls it “half a magic tool” — the description is accurate.

The interesting bit

The project leans on an LLM (OpenAI-compatible API, Moonshot demo included) to translate Chinese into English and juice up Midjourney-style prompts, because open-source image models still stumble on Chinese text. It’s a pragmatic admission of weakness rather than a polished workaround.

Key highlights

  • Docker Compose one-liner for setup, though local dev is macOS + Python 3.10.12 only
  • Uses edge-tts (free) for narration and ffmpeg for final muxing
  • Optional pollinations.ai path skips Hugging Face tokens entirely by using DALL·E 2
  • Web UI runs on localhost:5001
  • MIT licensed, with a WeChat QR code for moral support

Caveats

  • “Other environments may have compatibility issues” — the author’s words, not hedging
  • Image quality depends heavily on whether you feed it an OpenAI API key for prompt enhancement
  • Sentence splitting by punctuation alone will mangle dialogue and pacing

Verdict

Worth a spin if you want to prototype “visual novels” or automated storyboards without touching a timeline editor. Skip it if you need camera motion, scene continuity, or anything resembling professional video editing.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.