← all repositories
suno-ai/bark

Suno's open-source voice box that sometimes sings

A research-grade text-to-audio model that treats speech, music, and sound effects as equally valid outputs.

39.1k stars Jupyter Notebook Image · Video · Audio
bark
Velocity · 7d
+34
★ / day
Trend
steady
star history

What it does Bark is a transformer-based text-to-audio model from Suno that generates speech, music, background noise, and simple sound effects from raw text prompts. It skips phonemes entirely—text goes straight to audio—so it can handle multilingual speech, code-switching accents, lyrics wrapped in ♪ symbols, and even nonverbal cues like [laughs] or [sighs]. Pretrained checkpoints are MIT-licensed and ready for inference.

The interesting bit The model doesn’t actually distinguish between speech and music; it just generates audio. This means your prompt about quarterly earnings might come back as a torch song if you’re unlucky, but it also means you can coax out sound effects and emotional prosody that conventional TTS systems simply don’t do.

Key highlights

  • 100+ speaker presets across multiple languages, with automatic language detection from input text
  • Direct text-to-audio generation (no phoneme intermediate) using an EnCodec-quantized representation
  • Supports non-speech tags: [laughter], [music], ... for hesitations, CAPITALIZATION for emphasis
  • Full model needs ~12GB VRAM; smaller variant fits in 8GB via SUNO_USE_SMALL_MODELS=True
  • Also available through Hugging Face Transformers (4.31.0+) for reduced dependency friction

Caveats

  • Not a conventional TTS system: outputs can “deviate in unexpected ways from provided prompts,” per Suno’s own disclaimer
  • English quality is best; other languages are expected to improve with scaling
  • No custom voice cloning—only preset speakers
  • Real-time generation requires enterprise GPUs and PyTorch nightly; CPU and older GPUs are significantly slower

Verdict Worth exploring if you need expressive, generative audio for prototypes or creative tools. Skip it if you need predictable, production-grade TTS with guaranteed fidelity.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.