Teaching Transformers to Feel the Beat
A MIDI token format that gives language models a sense of musical time, so they can generate structured pop piano instead of aimless note soup.

What it does
REMI (REvamped MIDI-derived events) is a token representation that turns MIDI scores into discrete sequences with explicit metrical structure—beats, bars, tempo changes, and chord labels included. The authors train a Transformer-XL on this format to generate minute-long pop piano pieces with coherent rhythm and harmony, no post-processing required. You can prompt it with a MIDI file for continuation, or generate from scratch with control over local tempo and chord progression.
The interesting bit
The cleverness is in the encoding, not the architecture. Standard MIDI-like tokenizations dump note events as a flat stream; REMI interleaves beat and bar markers so the model learns when things happen, not just what. It’s a data-format hack that solves a musical structure problem without touching the transformer itself.
Key highlights
- Two pre-trained checkpoints available (~430 MB each): one with tempo control, one with tempo + chord conditioning
- 775 training MIDI files and 100 evaluation prompts provided for continuation experiments
- Interactive web demo exists (built by a contributor, not the authors)
- Sampling parameters (
temperature,topk) are exposed and acknowledged as critical to output quality midi2remi.ipynbshows the conversion pipeline
Caveats
- Locked to TensorFlow 1.14.0, which is well past end-of-life; getting this running on modern CUDA is archaeology
- Audio synthesis is explicitly punted to external DAWs or FluidSynth, with a known bug around tempo changes
- Fine-tuning on personal data is possible but undocumented beyond a GitHub issue thread
Verdict
Worth a look if you’re researching symbolic music generation or need a baseline for pop piano generation with structural control. Skip it if you want a maintained, modern framework—this is a 2020 research artifact with 2020 dependencies.