PyData NYC 2024

Faster PySpark with Apache Arrow
11-07, 15:20–16:00 (US/Eastern), Music Box

PySpark is the Python API for Apache Spark, an open-source distributed computing framework that enables large-scale, real-time data processing. In this talk, we will show how integrating Apache Arrow—a high-performance in-memory format—makes PySpark faster and more efficient.


This session will cover the key ways Apache Arrow improves PySpark’s performance. We’ll start with an overview of Arrow’s in-memory format and its role in optimizing data transfers within PySpark. Next, we’ll look at specific features like Pandas/Arrow UDFs. Finally, we’ll explore the role of Arrow in powering new features like Spark Connect, allowing seamless remote execution of Spark workloads. Whether you’re a beginner or an experienced PySpark user, you’ll walk away with practical insights to enhance your big data processing.


Prior Knowledge Expected

No previous knowledge expected

Allison Wang is a Software Engineer at Databricks and an Apache Spark Committer, specializing in Spark SQL and PySpark. She’s passionate about bridging Python with the big data ecosystem. Allison holds a bachelor’s degree in Computer Science from Carnegie Mellon University.