zhenye234/LLaSA_training
A speech synthesis model built on LLaMA architecture that generates audio from text using scaled train-time and inference-time compute.

LLaSA is an LLaMA-based neural text-to-speech system designed to generate natural speech from textual input. The system leverages scaled compute during both training and inference phases to improve output quality. It uses the XCodec2 codec for audio encoding and incorporates a Llama text tokenizer (e.g., Llama-3.2-1B-Instruct) for text encoding. Training supports distributed execution via torchrun or SLURM, and the project provides 160k hours of open-source tokenized speech data for training.