Turn a novel into a slideshow with a voiceover
A Chinese developer's weekend project that splits text by periods, generates images per sentence, and syncs them to TTS audio.

What it does
Feed it a block of text and it spits out an MP4. The pipeline is deliberately crude: split on Chinese periods, generate one image per sentence via Stable Diffusion (through Hugging Face or pollinations.ai), synthesize speech with edge-tts, then let the audio duration dictate how long each static image lingers. Subtitles are burned in with OpenCV. The author calls it “half a magic tool” — the description is accurate.
The interesting bit
The project leans on an LLM (OpenAI-compatible API, Moonshot demo included) to translate Chinese into English and juice up Midjourney-style prompts, because open-source image models still stumble on Chinese text. It’s a pragmatic admission of weakness rather than a polished workaround.
Key highlights
- Docker Compose one-liner for setup, though local dev is macOS + Python 3.10.12 only
- Uses edge-tts (free) for narration and ffmpeg for final muxing
- Optional pollinations.ai path skips Hugging Face tokens entirely by using DALL·E 2
- Web UI runs on localhost:5001
- MIT licensed, with a WeChat QR code for moral support
Caveats
- “Other environments may have compatibility issues” — the author’s words, not hedging
- Image quality depends heavily on whether you feed it an OpenAI API key for prompt enhancement
- Sentence splitting by punctuation alone will mangle dialogue and pacing
Verdict
Worth a spin if you want to prototype “visual novels” or automated storyboards without touching a timeline editor. Skip it if you need camera motion, scene continuity, or anything resembling professional video editing.