YuanGongND/ltu
An audio and speech large language model that bridges audio/speech perception with natural language understanding capabilities.

LTU and LTU-AS are audio and speech large language models that process audio input to enable open-ended question answering alongside strong performance on closed-ended audio tasks. The repository provides PyTorch implementations with pretrained checkpoints, datasets (OpenAQA and OpenASQA), training reproduction code, and fine-tuning capabilities. Interactive HuggingFace Space demos allow users to interact with the models without local GPU resources.