Tuning the Transformer: Context-Aware Masking for Controlled Music Generation in MIDI
Audio, images, and text already have well-established data processing pipelines proven to yield amazing results with large deep-learning models. However, applying these methods to music, especially in MIDI format, presents unique challenges. In this talk, we explore the application of context-aware masking techniques to data obtained by recording piano performances in MIDI format.
We demonstrate how methods inspired by masked language modeling, image inpainting, and next-token prediction can be adapted to preprocess MIDI data, capturing the harmonic, dynamic, and temporal information essential for music. These preprocessing strategies can lead to the creation of context-aware infilling tasks, which allow for the training of large transformer models that generate more emotionally nuanced musical performances.