11-08, 14:35–15:15 (US/Eastern), Central Park West
What if I told you that you could complete a JSON parse and extract task on your laptop before a distributed compute cluster even finishes booting up? DuckDB is a lightweight, in-process analytical database that runs on your laptop inside of Python and can wrangle large datasets efficiently, both from local and remote data sources. In this talk, we will show you how to query a dataset with DuckDB to extract, load and transform data right on your laptop. We'll then show you how to move your workloads to the Cloud, so you can run them at scale. By developing locally and pushing to the Cloud it's not only easy to develop, debug and iterate, but also makes it easy to quickly switch back and forth between workloads that do and don't require Cloud compute resources, cutting both cost and time.
DuckDB is an in-process OLAP database for fast analytical workloads that run seamlessly alongside your Python data workloads. With native interoperability with DataFrame tools like PyArrow, Pandas and Polars, you can easily integrate with your existing PyData stack. For a lot of data wrangling tasks that are in the realm of 10GB-1TB, you typically won't need to use any distributed or "big data" tools - modern laptops are powerful enough to handle them for you! This doesn't just come with cost saving benefits, but also makes it much easier to debug, share and deploy your workloads, without having to manage a distributed cluster and focusing your expertise on the important bits: interpreting, analyzing and visualizing the data.
In this talk, we'll show you a typical workload that you'd run on a distributed compute cluster engine like Spark for a data engineering project, and show you how to run that same workload on DuckDB in Python. We'll show you how to go from writing Python scripts to creating SQL queries that are fast and performant, and how you can easily generate, test and debug those queries using Python tools right on your laptop. We'll show you how to use the DuckDB relational API to run queries, without having to write SQL! Finally, we'll show you how to easily deploy your workload to the Cloud using MotherDuck, a Serverless Cloud Data Warehouse that natively supports DuckDB, to scale your workload so it can run on large datasets that don't fit on your laptop.
Talk overview:
This talk is aimed at data scientists and engineers who handle medium to large datasets (10GB-1TB) and run data prep and ETL pipelines in large distributed compute environments like Spark, and are interested to learn how a new tool like DuckDB can help speed up their workflows.
- Introduction: why DuckDB?
- How-to guide: Run a data engineering workload on your laptop with DuckDB and MotherDuck
(Image is from WikiHow - How to Breed Ducks, https://www.wikihow.com/Breed-Ducks)
No previous knowledge expected
Guen is a software engineer at MotherDuck on the Ecosystems team. Previously, she was a Sr. Quantum Measurement Engineer at Microsoft. She's spent her career in software engineering, data engineering and data science with Python in the context of scientific data acquisition, analysis and computation for experimental physics and biotech. She has given introductory talks and workshops on quantum computing with Python at various conferences, hackathons and events.
developer advocate at MotherDuck