11-08, 10:10–10:50 (US/Eastern), Central Park East
Tabular data is ubiquitous, and pandas has been the de facto tool in Python for
analyzing it. However, as data size scales, analysis using pandas may become
untenable. Luckily, modern analytical databases (like DuckDB) are able to
analyze this same tabular data, but perform orders-of-magnitude faster than
pandas, all while using less memory. Many of these systems only provide a SQL
interface though; something far different from pandas’ dataframe interface,
requiring a rewrite of your analysis code.
This talk will lay out the current database / data landscape as it relates to
the PyData stack, and explore how Ibis (an open-source, pure Python, dataframe
interface library) can help decouple interfaces from engines, to improve both performance
and portability. We'll examine other solutions for interacting with SQL from Python and
discuss some of their strengths and weaknesses.
Ibis is an open-source, pure Python library that lets you write Python to build up expressions
that can be executed on a wide array of backends / execution engines (SQLite,
DuckDB, Postgres, Spark, Clickhouse, Snowflake, BigQuery, and more!).
Modern analytical databases (like DuckDB) are able to analyze tabular data
orders-of-magnitude faster than pandas, all while using less memory.
pandas and other Python libraries can interact with databases, but they were
not designed to do so efficiently. Pulling ALL of the data to your local
machine to then perform a reduction or aggregation is only tractable for very
small problems.
Treating a remote database as a data store isn’t wrong, but it provides an
incomplete view of everything these systems can offer.
Because they are very, very fast. 50 years of database research hasn't gone to
waste - modern execution engines perform all kinds of optimizations to deliver
results quickly.
In a cruel twist of fate, though, almost all of them require you to write SQL in
order to use them. SQL is not an ideal tool for exploratory data analysis.
If you know exactly how to answer the question in front of you, then SQL is
probably (possibly) fine. But you don't always know how - that's part of what
the exploration is.
SQL is only a language – it’s an interface. The execution engine is a separate
thing. Historically the interface and the engine have been very tightly
coupled, but they don’t have to be.
Maybe you would like to use the DuckDB execution engine, but you don’t like the interface (SQL)?
Or you would like to use the Spark execution engine, but you don’t like the interface (PySpark API)?
The interface shouldn’t be a hurdle for a user to clear in order to make use of
the available tools. In the scientific Python community, SQL, in particular, is
a hurdle that many users have turned away from. Ibis provides a consistent,
Pythonic, and intuitive interface to interact with execution engines, even when
their only “advertised” interface is SQL.
No previous knowledge expected