PyData NYC 2024

Data Secrets from a Platform Engineer
11-07, 11:40–12:20 (US/Eastern), Central Park West

Beneath the surface of sophisticated services exists a vast realm that is hidden from most data scientists… the Platform. Present in every pipeline, every query, every notebook, it shapes the availability, consistency, and resilience of the systems on which we depend. In this talk, discover how understanding more about “the Platform” can enable data scientists to make choices that improve their chances of getting to production. The path to unlocking your data superpowers awaits. Will you take it?


Though it governs the very fabric of our data-driven universe, most data scientists are blissfully unaware of “the Platform”. Much like "the Matrix" from the 1999 science fiction action film by the Wachowski sisters, a data platform is designed to be invisible to the naked eye. Most data scientists operate at a higher level of abstraction. We’re encouraged to build on top of the Platform, going deep on the algorithms and science, but staying light on the engineering. What if we have that backwards?

In this talk, we’ll learn how a modest understanding of platform engineering can make you a much better data scientist. For instance, learning about consistency will give you more intuition around automated model retraining, and how to retrieve snapshots from prior training cycles. Learning about the benefits of columnar data storage will make you think twice about saving data as CSV. Understanding fault tolerance can help you implement recovery strategies into your machine learning pipelines, such as checkpointing and error handling. And perhaps most importantly, learning about the platform enables you to build models that integrate seamlessly with existing systems and services, so you’ll be more likely to get your models to production. And trust me, you want to be known as a data scientist who knows how to get models to production.

This talk is your portal to the reality beneath the surface of data science. You will learn to perceive and interpret the underlying principles — availability, consistency, concurrency, and failure — that influence your data’s journey, and see how mastering these elements can elevate your analytics. The choice is yours: continue operating within the confines of the known, or delve deeper into the world of platform engineering to truly transform your data science practice?

Come learn the secrets of the "Platform" and unlock your inner data science Neo. 

Outline

  1. Introduction: Why go beneath the surface of data science? (4 min)
  2. Always Bet on Parquet (5 min)
  3. Old Data or Slow Data: Pick Your Poison (7 min)
  4. There is No Batch, or why all data is actually streaming data (5 min)
  5. It's OK to Fail (4 min)
  6. Now what? (5 min)

Prior Knowledge Expected

No previous knowledge expected

Dr. Rebecca Bilbro is an applied AI/ML engineer and one of the pioneers of the data science revolution of the early 2010’s. Co-author of Applied Text Analysis with Python (O'Reilly 2017) and Apache Hudi: The Definitive Guide (O'Reilly 2024), Rebecca has worked across academia, industry, and the public sector. She is co-creator of Yellowbrick, a Python library that integrates the scikit-learn and matplotlib APIs to support more convenient model diagnostics and steering. As co-founder and CTO of Rotational Labs, Rebecca is motivated by a desire to unite the data science and engineering communities. She and her team help other companies leverage in-house domain expertise and data to build and deploy LLMs, data products, and services. Rebecca earned her doctorate from the University of Illinois, Urbana-Champaign, where her research centered on domain-specific languages within engineering.