How to Deploy Machine Learning Inference Code to Production: A Case Study in Optimizing Code from a Latency and Memory Perspective PyData NYC 2024

How to Deploy Machine Learning Inference Code to Production: A Case Study in Optimizing Code from a Latency and Memory Perspective
.ical

11-07, 14:35–15:15 (US/Eastern), Music Box

Let’s say you want to run a machine learning experiment – you want to tag a paragraph with appropriate labels. Given all the models and sample code out there, writing a notebook or python script that does this is often relatively easy, though running it end to end on your single computer can, at times, take a while and make everything else slow – these processes are compute heavy.
Now, let’s say you like what your notebook is doing, you have made the best tagging script out there, and you want to enable all your users to be able to tag their paragraphs using your magic. Imagine if you wanted to deploy this to many users, how would you do that? What are steps you can take to not have all your users waiting for minutes before they get their tagged output?
Running machine learning inference on your local machine vs deployed to hundreds of users can look very different. In this talk, I will walk through a case study and share learnings from having to optimize a tagging project for latency and memory.

The following is the outline of this talk.
- Introduction, Background, and Outlining the Problem (5 minutes, running total: 5 minutes):
o The motivation and goal for the project. In this section I’ll make a case for why it’s essential to think about performance when going to production.
- Deployment (3 minutes: running total: 8 minutes):
o Minimal details about how deployment looked (this is not going to include any devops details and will be kept at a high level).
- What Seems to Be the Problem? (2 minutes: running total: 10 minutes):
o The observed performance and memory issues during deployment.
- Drawing Board – Benchmarking and Methods to Find Potential Solutions (5 minutes, running total: 15 minutes):
o What are the steps taken to understand the issues, what is the source of the problem, and how to brainstorm possible next steps / solutions.
- Solution Implementation and Testing: Latency Edition (5 minutes: running total: 20 minutes):
o What did I do to speed up the code: the libraries, the benchmarking, parallel programming, etc.
- Solution Implementation and Testing: Memory Edition (5 minutes: running total: 25 minutes):
o What did I do to solve the memory issues: the libraries, the benchmarking, pre downloading models, etc.
- Wrap-up and Q&A (5 minutes, running total: 30 minutes)
o Let’s go back to the issues observed, did the steps taken solve the problem, in what capacity? And audience questions.

Prior Knowledge Expected –

No previous knowledge expected

Saba Nejad

Saba Nejad is a Data Engineer at Point72 working mostly with alternative data within the energy and industrials sector. She is broadly interested in using mathematics and programming to gain insight from real world data. Prior to joining Point72, she was studying at MIT where she was doing research at the Institute for Data, Systems, and Society. She was previously a Product Manager at Quantopian.

How to Deploy Machine Learning Inference Code to Production: A Case Study in Optimizing Code from a Latency and Memory Perspective .ical 11-07, 14:35–15:15 (US/Eastern), Music Box

How to Deploy Machine Learning Inference Code to Production: A Case Study in Optimizing Code from a Latency and Memory Perspective
.ical

11-07, 14:35–15:15 (US/Eastern), Music Box