← all repositories
lhotse-speech/lhotse

Kaldi's Python cousin wrangles speech, video, and text

A data-prep library that treats audio snippets like editable clips, now stretching into multimodal territory.

1.1k stars Python Data Tooling
lhotse
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does Lhotse is a Python toolkit for preparing multimodal training data—speech, audio, video, image, and text—for machine learning pipelines. It provides standard recipes for common corpora, represents metadata in human-readable JSON/YAML manifests, and feeds PyTorch through task-specific Dataset classes. The core abstraction is the “cut”: a slice of audio or video that you can mix, truncate, pad, and augment on-the-fly without pre-baking everything to disk.

The interesting bit The cut abstraction lets you manipulate training samples as lazy, composable objects rather than concrete files. Feature extraction and augmentation can run pre-computed (with optional lilcom compression) or on-demand, and the library supports “feature-space cut mixing”—blending already-computed features rather than raw waveforms. For storage, Lhotse Shar offers a WebDataset-like sequential format optimized for streaming I/O.

Key highlights

  • Born from the Kaldi speech-processing lineage, paired with the k2 finite-state automata library
  • Dataset blending and on-the-fly bucketing for multi-corpus training
  • Built-in deduplication and randomization for distributed multi-node setups
  • Colab tutorials covering workflows, WebDataset integration, and image/video loading
  • Extensible backend system for audio (torchaudio, soundfile, torchcodec), I/O, and resampling

Caveats

  • The README warns that forcing MSCIOBackend for all URLs “may break functionality”
  • torchaudio is now optional; disabling it strips “many functionalities” though basics remain
  • Python 3.7+ supported, but some newer backends (e.g., torchcodec) require recent PyTorch versions

Verdict Worth a look if you’re building speech or multimodal models and want Kaldi’s corpus recipes without the C++ plumbing. Less compelling if your data pipeline is already humming in pure PyTorch or another framework.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.