Is Irodori-TTS open source?

Yes — Aratako/Irodori-TTS is open source, released under the MIT license.

What language is Irodori-TTS written in?

Aratako/Irodori-TTS is primarily written in Python.

How popular is Irodori-TTS?

Aratako/Irodori-TTS has 1k stars on GitHub.

Where can I find Irodori-TTS?

Aratako/Irodori-TTS is on GitHub at https://github.com/Aratako/Irodori-TTS.

← all repositories

Aratako/Irodori-TTS

Emoji as a steering wheel for synthetic voices

A Japanese TTS system that lets you control speaking style by sprinkling emoji into the input text.

★1k stars Python Image · Video · Audio

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Irodori-TTS generates speech from text using a flow-matching diffusion transformer, with a twist: certain checkpoints let you drop emoji into the input string to nudge delivery style and non-verbal expression. It also supports zero-shot voice cloning from reference audio, and a “VoiceDesign” mode where you describe a speaker in natural language (“calm female voice, close distance, soft and natural”) rather than finding a matching sample.

The interesting bit

The emoji control is the memorable hook, but the real architectural bet is on multi-branch conditioning. The VoiceDesign model can simultaneously attend to three separate signals: the text to speak, a reference audio clip for speaker identity, and a caption describing emotional tone or speaking style. The base model and VoiceDesign variants share a DiT backbone with Low-Rank AdaLN and half-RoPE, trained on continuous DACVAE latents rather than raw waveforms or discrete tokens.

Key highlights

Flow-matching diffusion transformer (RF-DiT) with automatic duration prediction in v3 checkpoints
Three-branch conditioning: text + reference speech + style caption in VoiceDesign mode
Speaker Inversion: learn compact embedding tokens for a target voice while freezing the base model
PEFT/LoRA fine-tuning supported for adapting released checkpoints without full retraining
SilentCipher audio watermarking applied when the library is available
v3 codebase backward-compatible with v2 checkpoints; v1 checkpoints are not

Caveats

Emoji-driven style control only works in “supported checkpoints” — the README doesn’t specify which ones beyond implying v3
The released codec is specifically Semantic-DACVAE-Japanese-32dim; non-Japanese quality is unclear
v1 preprocessing and checkpoints are incompatible with v2/v3; upgrading means retraining from scratch

Verdict

Worth a look if you’re building Japanese voice applications and want fine-grained style control without collecting hours of labeled emotional speech. Skip if you need proven multilingual support or a stable, frozen API — the version churn (v1→v2→v3 in rapid succession) suggests the project is still finding its footing.

Frequently asked

What is Aratako/Irodori-TTS?: A Japanese TTS system that lets you control speaking style by sprinkling emoji into the input text.
Is Irodori-TTS open source?: Yes — Aratako/Irodori-TTS is open source, released under the MIT license.
What language is Irodori-TTS written in?: Aratako/Irodori-TTS is primarily written in Python.
How popular is Irodori-TTS?: Aratako/Irodori-TTS has 1k stars on GitHub.
Where can I find Irodori-TTS?: Aratako/Irodori-TTS is on GitHub at https://github.com/Aratako/Irodori-TTS.