← all repositories
Natooz/MidiTok

The Boring Plumbing Behind Music-Generating Transformers

MidiTok exists because no one should have to reimplement ten different academic tokenization papers just to feed a MIDI file into a transformer.

MidiTok
Collecting fresh signals — velocity needs a few days of history.
collecting data…
star history

What it does

MidiTok converts MIDI and ABC files into token sequences for deep learning models. It wraps ten published tokenization schemes—REMI, CPWord, Octuple, and others—behind one consistent API with shared configuration. The library also handles the tedious adjacent tasks: BPE and WordPiece vocabulary training via Hugging Face tokenizers, data augmentation, and ready-made PyTorch dataset helpers.

The interesting bit

The real value isn’t any single tokenization method; it’s the abstraction layer that stops you from rewriting the same MIDI preprocessing boilerplate every time a new paper drops a new vocabulary. By treating tokenizers as interchangeable modules, MidiTok effectively turns music-format wrangling into a config change.

Key highlights

  • Implements ten established tokenization methods (REMI, MuMIDI, PerTok, etc.) with a uniform interface
  • Trains subword vocabularies (BPE, Unigram, WordPiece) using Hugging Face tokenizers for fast encoding
  • Integrates with the Hugging Face Hub for sharing trained tokenizer configurations
  • Ships PyTorch dataset utilities and data collators to reduce boilerplate
  • Reads and writes MIDI and ABC files via Symusic

Caveats

  • Only MIDI and ABC formats are supported; MusicXML and Control Change messages are on the todo list
  • The authors note that global and track event parsing could use a Rust or C++ speed-up

Verdict

Worth a look if you’re building generative music models and don’t want to maintain a private fork of every ISMIR tokenization paper. If you’re just transcribing the occasional MIDI, it’s overkill.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.