Kaldi's Python cousin wrangles speech, video, and text
A data-prep library that treats audio snippets like editable clips, now stretching into multimodal territory.

What it does Lhotse is a Python toolkit for preparing multimodal training data—speech, audio, video, image, and text—for machine learning pipelines. It provides standard recipes for common corpora, represents metadata in human-readable JSON/YAML manifests, and feeds PyTorch through task-specific Dataset classes. The core abstraction is the “cut”: a slice of audio or video that you can mix, truncate, pad, and augment on-the-fly without pre-baking everything to disk.
The interesting bit The cut abstraction lets you manipulate training samples as lazy, composable objects rather than concrete files. Feature extraction and augmentation can run pre-computed (with optional lilcom compression) or on-demand, and the library supports “feature-space cut mixing”—blending already-computed features rather than raw waveforms. For storage, Lhotse Shar offers a WebDataset-like sequential format optimized for streaming I/O.
Key highlights
- Born from the Kaldi speech-processing lineage, paired with the k2 finite-state automata library
- Dataset blending and on-the-fly bucketing for multi-corpus training
- Built-in deduplication and randomization for distributed multi-node setups
- Colab tutorials covering workflows, WebDataset integration, and image/video loading
- Extensible backend system for audio (torchaudio, soundfile, torchcodec), I/O, and resampling
Caveats
- The README warns that forcing MSCIOBackend for all URLs “may break functionality”
- torchaudio is now optional; disabling it strips “many functionalities” though basics remain
- Python 3.7+ supported, but some newer backends (e.g., torchcodec) require recent PyTorch versions
Verdict Worth a look if you’re building speech or multimodal models and want Kaldi’s corpus recipes without the C++ plumbing. Less compelling if your data pipeline is already humming in pure PyTorch or another framework.