PyData NYC 2024

Tuning the Transformer: Context-Aware Masking for Controlled Music Generation in MIDI
11-07, 16:05–16:45 (US/Eastern), Music Box

Audio, images, and text already have well-established data processing pipelines proven to yield amazing results with large deep-learning models. However, applying these methods to music, especially in MIDI format, presents unique challenges. In this talk, we explore the application of context-aware masking techniques to data obtained by recording piano performances in MIDI format.

We demonstrate how methods inspired by masked language modeling, image inpainting, and next-token prediction can be adapted to preprocess MIDI data, capturing the harmonic, dynamic, and temporal information essential for music. These preprocessing strategies can lead to the creation of context-aware infilling tasks, which allow for the training of large transformer models that generate more emotionally nuanced musical performances.


This talk explores how large neural networks can be used to make machine-generated piano performances more musically expressive. The objective is to demonstrate how context-aware masking techniques, inspired by advances in NLP, can be adapted to capture the subtle harmonic, dynamic, and temporal aspects of music when applied to piano performances recorded in event-based format like MIDI.

The presentation will outline the design and use of the Performance Inference And Note Orchestration (PIANO) dataset, specifically focusing on partial voice and dynamic reconstruction tasks, using the rich representation of the piano keyboard interface offered by MIDI recordings. We will explore the methodology for creating custom context-aware MIDI-based datasets by operating on table-like representations of piano performances as key-press events.

The discussion will cover pre-training and fine-tuning pipelines, providing a practical overview of how to train large language models with 100+ million parameters using musical input. Various methods of generating music will be examined in the fine-tuning step, along with the specific challenges of working with GPT models and tokenizing MIDI data. We’ll include practical tips on building dynamic data laboratories for training and validation cycles, with tools like HuggingFace, Pandas, PyTorch, and Streamlit. Audio examples will be shared throughout, illustrating both the successes and challenges encountered in the process.

Attendees should have a basic understanding of machine learning, particularly LLMs, and some familiarity with music theory and MIDI data. This talk will be particularly valuable for those interested in the intersection of AI and music, offering a complete journey from data preparation to generative modeling. Those working on generative models in domains like NLP and computer vision will gain insights into how similar techniques can be adapted for musical data.


Prior Knowledge Expected

Previous knowledge expected

I have been in love with mathematics, physics, and music since childhood, and I started programming at the age of 15 - I have been fascinated by data science ever since. I'm also a guitar player and a performing chorister, now exploring the possible connections between music and data science.

I've had the opportunity to work on a wide range of tasks, from collaborating on software projects for industrial plants and performing time-series modeling through training transformer models with custom tokenizers to implementing machine-learning methods for security automation as a vendor at Google.

I am currently a computer science student at the Faculty of Mathematics and Information Science at Warsaw University of Technology. Since August 2023, I have been working as a Machine Learning Engineer at EPR Labs, where I combine my passions for music, mathematics, and data science. I develop software for training and evaluating large language models on musical data, among other fascinating projects.

Data science connoisseur with an obsession for converting numbers into sounds.