Is MidiTok open source?

Yes — Natooz/MidiTok is open source, released under the MIT license.

What language is MidiTok written in?

Natooz/MidiTok is primarily written in Python.

How popular is MidiTok?

Natooz/MidiTok has 884 stars on GitHub.

Where can I find MidiTok?

Natooz/MidiTok is on GitHub at https://github.com/Natooz/MidiTok.

← all repositories

Natooz/MidiTok

The Boring Plumbing Behind Music-Generating Transformers

MidiTok exists because no one should have to reimplement ten different academic tokenization papers just to feed a MIDI file into a transformer.

★884 stars Python Data Tooling Image · Video · Audio

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

MidiTok converts MIDI and ABC files into token sequences for deep learning models. It wraps ten published tokenization schemes—REMI, CPWord, Octuple, and others—behind one consistent API with shared configuration. The library also handles the tedious adjacent tasks: BPE and WordPiece vocabulary training via Hugging Face tokenizers, data augmentation, and ready-made PyTorch dataset helpers.

The interesting bit

The real value isn’t any single tokenization method; it’s the abstraction layer that stops you from rewriting the same MIDI preprocessing boilerplate every time a new paper drops a new vocabulary. By treating tokenizers as interchangeable modules, MidiTok effectively turns music-format wrangling into a config change.

Key highlights

Implements ten established tokenization methods (REMI, MuMIDI, PerTok, etc.) with a uniform interface
Trains subword vocabularies (BPE, Unigram, WordPiece) using Hugging Face tokenizers for fast encoding
Integrates with the Hugging Face Hub for sharing trained tokenizer configurations
Ships PyTorch dataset utilities and data collators to reduce boilerplate
Reads and writes MIDI and ABC files via Symusic

Caveats

Only MIDI and ABC formats are supported; MusicXML and Control Change messages are on the todo list
The authors note that global and track event parsing could use a Rust or C++ speed-up

Verdict

Worth a look if you’re building generative music models and don’t want to maintain a private fork of every ISMIR tokenization paper. If you’re just transcribing the occasional MIDI, it’s overkill.

Frequently asked

What is Natooz/MidiTok?: MidiTok exists because no one should have to reimplement ten different academic tokenization papers just to feed a MIDI file into a transformer.
Is MidiTok open source?: Yes — Natooz/MidiTok is open source, released under the MIT license.
What language is MidiTok written in?: Natooz/MidiTok is primarily written in Python.
How popular is MidiTok?: Natooz/MidiTok has 884 stars on GitHub.
Where can I find MidiTok?: Natooz/MidiTok is on GitHub at https://github.com/Natooz/MidiTok.