Building machine learning pipelines that scale: a case study using Ibis, IbisML, and dlt PyData NYC 2024

Building machine learning pipelines that scale: a case study using Ibis, IbisML, and dlt
.ical

11-06, 10:50–12:20 (US/Eastern), Music Box

Libraries like Ibis have been gaining traction recently, by unifying the way we work with data across multiple data platforms—from dataframe APIs to databases, from dev to prod. What if we could extend the abstraction to machine learning workflows (broadly, sequences of steps that implement fit and transform methods)? In this tutorial, we will develop an end-to-end machine learning project to predict the live win probability at any given move during a chess game.

For attendees: Follow the instructions on https://github.com/deepyaman/lichess-live-win-probability-tutorial/blob/main/00%20-%20Welcome.ipynb to create a GitHub Codespace for this tutorial.

As Python has become the lingua franca of data science, pandas and scikit-learn have cemented their roles in the standard machine learning toolkit. However, when data volumes rise, this stack becomes unwieldy (requiring proportionately-larger compute, subsampling to reduce data size, or both) or altogether untenable.

Luckily, modern analytical databases (like DuckDB) and dataframe libraries (such as Polars) can crunch this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Ibis already provides a unified dataframe API that lets users leverage a plethora of popular databases and analytics tools (BigQuery, Snowflake, Spark, DuckDB, etc.) without rewriting their data engineering code. However, at scale, the performance bottleneck is pushed to the ML pipeline.

IbisML extends the intrinsic benefits of using Ibis to the ML workflow. It lets you bring your ML to the database (or other Ibis-supported backend), and supports efficient integration with modeling frameworks like XGBoost, PyTorch, and scikit-learn. On top of that, IbisML steps can be used as estimators within the familiar context of scikit-learn pipelines.

In this tutorial, we'll cover:

Ibis basics, and how we can apply them to explore the Lichess chess games database and create meaningful features.
IbisML constructs, including Steps and Recipes, and how we can combine them to process features before passing them to our live win probability model.
Data handoff for model training and inference, completing our end-to-end ML workflow.

This is a hands-on tutorial, and you will train a simple (not great!) live win probability model on a provided dataset. You'll also see how the result can be run at scale on a distributed backend. Participants should ideally have some experience using Python dataframe libraries; scikit-learn or other modeling framework familiarity is helpful but not required.

Prior Knowledge Expected –

Previous knowledge expected

Deepyaman Datta

Deepyaman is a maintainer of Kedro, an open-source Python framework for building production-ready data science pipelines. He is passionate about building and contributing to the broader open-source data ecosystem.

Previously, Deepyaman was a software engineer at Voltron Data. Before their acquisition by Voltron Data, he was a Founding Machine Learning Engineer at Claypot AI, working on their real-time feature engineering platform. Prior to that, he led data engineering teams and asset development across a range of industries at QuantumBlack, AI by McKinsey.

Anjali Datta

Anjali is a postdoc at Stanford Medicine working on MRI processing to identify neurosurgical targets. She also has a PhD in Electrical Engineering from Stanford, during which she developed MRI acquisition and reconstruction methods. Medical imaging is of course a field where ML is taking over, and Anjali is now interested in the applications of deep learning to MRI and other signal processing.

Building machine learning pipelines that scale: a case study using Ibis, IbisML, and dlt .ical 11-06, 10:50–12:20 (US/Eastern), Music Box

Building machine learning pipelines that scale: a case study using Ibis, IbisML, and dlt
.ical

11-06, 10:50–12:20 (US/Eastern), Music Box