The Boring Plumbing Behind Music-Generating Transformers
MidiTok exists because no one should have to reimplement ten different academic tokenization papers just to feed a MIDI file into a transformer.

What it does
MidiTok converts MIDI and ABC files into token sequences for deep learning models. It wraps ten published tokenization schemes—REMI, CPWord, Octuple, and others—behind one consistent API with shared configuration. The library also handles the tedious adjacent tasks: BPE and WordPiece vocabulary training via Hugging Face tokenizers, data augmentation, and ready-made PyTorch dataset helpers.
The interesting bit
The real value isn’t any single tokenization method; it’s the abstraction layer that stops you from rewriting the same MIDI preprocessing boilerplate every time a new paper drops a new vocabulary. By treating tokenizers as interchangeable modules, MidiTok effectively turns music-format wrangling into a config change.
Key highlights
- Implements ten established tokenization methods (REMI, MuMIDI, PerTok, etc.) with a uniform interface
- Trains subword vocabularies (BPE, Unigram, WordPiece) using Hugging Face tokenizers for fast encoding
- Integrates with the Hugging Face Hub for sharing trained tokenizer configurations
- Ships PyTorch dataset utilities and data collators to reduce boilerplate
- Reads and writes MIDI and ABC files via Symusic
Caveats
- Only MIDI and ABC formats are supported; MusicXML and Control Change messages are on the todo list
- The authors note that global and track event parsing could use a Rust or C++ speed-up
Verdict
Worth a look if you’re building generative music models and don’t want to maintain a private fork of every ISMIR tokenization paper. If you’re just transcribing the occasional MIDI, it’s overkill.