PyData NYC 2024

Mastering DataFrame Diffing Techniques
11-08, 10:10–10:50 (US/Eastern), Central Park West

This talk will explore various techniques for efficiently comparing and diffing data frames, an essential task in data analysis and data engineering workflows. We'll cover everything from simple pandas assertions and SQL joins to more advanced tools like duckdb and datacompy. Additionally, we'll dive into time series data frame diffing using asof joins. Finally, we'll discuss how to perform these operations at scale on platforms like Apache Spark, Snowflake and BigQuery. We will also discuss insights from commercial tools like Datafold’s Data-Diff and how these ideas can be implemented in both open-source and enterprise environments.


Data frame comparison is a fundamental operation in many data-centric projects, whether it's validating data quality, tracking changes over time, or reconciling data across systems. However, diffing large and complex data frames can be challenging, especially when dealing with time series data or working at scale.

In this talk, we'll start by exploring the basics of data frame diffing using pandas' built-in assertions and join operations. We'll then move on to more advanced techniques and libraries, such as duckdb and datacompy, which offer efficient and feature-rich diffing capabilities.

One particular area of focus will be diffing time series data frames, a common scenario in finance, IoT, and other domains. We'll delve into the intricacies of asof joins and how they can be leveraged for accurate and performant time series diffing.

Finally, we'll discuss strategies for scaling data frame diffing operations on cloud data platforms like Snowflake and BigQuery, enabling efficient comparisons on large, distributed datasets.

Throughout the talk, we'll emphasize practical examples and real-world use cases, ensuring that attendees can immediately apply the techniques learned to their own projects.

Prior Knowledge Expected:
Attendees should have a basic understanding of Python and the pandas library. Familiarity with SQL and cloud data platforms (e.g., Snowflake, BigQuery) would be beneficial but not strictly required.

Audience:
This talk is aimed at data engineers, data analysts, and software developers working with structured data and data frames in Python. It will be particularly relevant for those dealing with large datasets, time series data, or data validation and reconciliation tasks.

Takeaways:
After attending this talk, participants will:

Understand the different approaches to data frame diffing, from simple pandas operations to advanced libraries.
Learn techniques for accurate and efficient time series data frame diffing using asof joins.
Gain insights into scaling data frame diffing operations on cloud data platforms.
Be equipped with practical examples and use cases to apply these techniques in their own projects.

Slides: https://1drv.ms/p/s!Anz-_P4mNLkWgukdu8vnWjjvzgeeGA


Prior Knowledge Expected

No previous knowledge expected

I“m currenty a software enigeer at the data engineering team at two sigma.