Suno's open-source voice box that sometimes sings
A research-grade text-to-audio model that treats speech, music, and sound effects as equally valid outputs.

What it does Bark is a transformer-based text-to-audio model from Suno that generates speech, music, background noise, and simple sound effects from raw text prompts. It skips phonemes entirely—text goes straight to audio—so it can handle multilingual speech, code-switching accents, lyrics wrapped in ♪ symbols, and even nonverbal cues like [laughs] or [sighs]. Pretrained checkpoints are MIT-licensed and ready for inference.
The interesting bit The model doesn’t actually distinguish between speech and music; it just generates audio. This means your prompt about quarterly earnings might come back as a torch song if you’re unlucky, but it also means you can coax out sound effects and emotional prosody that conventional TTS systems simply don’t do.
Key highlights
- 100+ speaker presets across multiple languages, with automatic language detection from input text
- Direct text-to-audio generation (no phoneme intermediate) using an EnCodec-quantized representation
- Supports non-speech tags:
[laughter],[music],...for hesitations, CAPITALIZATION for emphasis - Full model needs ~12GB VRAM; smaller variant fits in 8GB via
SUNO_USE_SMALL_MODELS=True - Also available through Hugging Face Transformers (4.31.0+) for reduced dependency friction
Caveats
- Not a conventional TTS system: outputs can “deviate in unexpected ways from provided prompts,” per Suno’s own disclaimer
- English quality is best; other languages are expected to improve with scaling
- No custom voice cloning—only preset speakers
- Real-time generation requires enterprise GPUs and PyTorch nightly; CPU and older GPUs are significantly slower
Verdict Worth exploring if you need expressive, generative audio for prototypes or creative tools. Skip it if you need predictable, production-grade TTS with guaranteed fidelity.