Pushing Cython to its Limits in Scikit-learn PyData NYC 2024

Pushing Cython to its Limits in Scikit-learn
.ical

11-07, 15:20–16:00 (US/Eastern), Central Park West

scikit-learn is a machine-learning library for Python that uses NumPy and SciPy for numerical operations. Scikit-learn has its own compiled code for performance-critical computation written in C, C++, and Cython. The library primarily focuses on Cython for compiled code because it is easy to use and approachable. In this talk, we dive into many techniques scikit-learn employs to utilize Cython fully. We will cover features like using the C++ standard library within Cython, fused types, code generation with the Tempita engine, and OpenMP for parallelization.

Scikit-learn is a machine-learning library largely backed by NumPy and SciPy. Although NumPy and SciPy have many compiled operations, scikit-learn requires domain-specific operations for performance-critical code. scikit-learn's compiled code is mainly in Cython because of its ease of use and approachability. At scikit-learn, we push Cython to its limits by using many of its features to support machine learning use cases. In this talk, we cover many of these techniques, such as:

OpenMP to parallelize computation
Fused types to write a function once and support multiple data types
Using features from the C++ standard library such as vector, map, stack, or algorithm from Cython
Memory-views to work with NumPy arrays with Python's Buffer protocol
Tempita for code generation
C structs and Python classes from Cython

By the end of this talk, you will have a deeper understanding of Cython's capabilities that you can apply to your use cases. This intermediate talk is designed for machine learning engineers, data scientists, and software engineers.

Prior Knowledge Expected –

Previous knowledge expected

Thomas J. Fan

Thomas J. Fan is a senior machine learning engineer at Union.ai and a maintainer of scikit-learn, an open-source machine learning library for Python. He led the development of scikit-learn's set_output API, which allows transformers to return pandas DataFrames. Previously, Thomas worked at Columbia University to improve interoperability between scikit-learn and AutoML systems. He also maintains skorch, a neural network library that wraps PyTorch.

Pushing Cython to its Limits in Scikit-learn .ical 11-07, 15:20–16:00 (US/Eastern), Central Park West

Pushing Cython to its Limits in Scikit-learn
.ical

11-07, 15:20–16:00 (US/Eastern), Central Park West