PyData NYC 2024

Adopting Open-Source Tools for Time Series Forecasting: Opportunities and Pitfalls
11-07, 10:10–10:50 (US/Eastern), Music Box

Forecasting involves predicting future values of a time series based on historical values and is critical for informed decision-making in fields like finance, planning, and energy. The open-source community has developed several Python libraries to streamline the commonly recurring stages of a forecasting pipeline, thereby minimizing redundancy and ensuring consistency across different projects. Python Libraries like SKTime, SKForecast, and Darts are some of the most used libraries for time series forecasting; data science teams are often confounded with crafting systematic approaches to evaluating such options.
In this talk, we will present a decision framework for data science leaders and teams to choose the appropriate tooling for their forecasting projects. Specifically, we will explore three critical dimensions that teams must consider:

Data Understanding: How well does the library support Exploratory Data Analysis (EDA)?

Data Preparation: How robust and intuitive are the tool's preprocessing capabilities for handling quality issues, like missing values, NaNs, duplicate data, and exogenous variables?

Modeling & Backtesting: How effective and scalable are the library’s modeling and evaluation capabilities for forecasting algorithms?

Each of these dimensions present tradeoffs and our decision framework is intended to help evaluate and mitigate them. We will present a case study from energy management and use SKTime and SKForecast to guide the discussion. The work presented in this talk was conducted as part of my internship at Schneider Electric.

This talk is ideal for data scientists, machine learning engineers, and technical decision-makers that develop and maintain forecasting products and are driven to scale their efforts. Whether you are new to time series forecasting or an experienced practitioner looking to refine your toolset, this session will provide valuable insights into selecting the right open-source tools for your project.


Scaling the data science development process to multiple use cases and products entails adopting libraries that streamline and abstract away commonly occurring tasks of the training and inference pipelines. Given the critical role played by time series forecasting in today’s world, the open-source community has responded to several tools to help data science teams scale their efforts. This talk is designed to guide data scientists, machine learning engineers, and technical decision-makers through evaluating and choosing the most suitable open-source tools for time series forecasting.
We will present a three-dimensional decision framework for systematically evaluating time series forecasting libraries. We will then apply our framework to a case study involving an energy management system to compare SKTime and SKForecast, two of the most popular time series forecasting libraries. The dimensions of our framework are as follows.

  1. Data Understanding: How well a tool supports Exploratory Data Analysis (EDA). EDA is crucial for gaining initial insights into the dataset through descriptive analysis and visualization, which lay the groundwork for effective model development.

  2. Data Preparation: Proper data preparation is vital to the success of any forecasting model. We'll discuss how different tools handle common data challenges like missing values, NaNs, duplicate data, and exogenous variables. Efficient preprocessing capabilities streamline the data preparation process and ensure that the forecasting models have high-quality inputs.

  3. Modeling & Backtesting: Backtesting is the process of testing a forecasting model on historical data to evaluate its accuracy and reliability. In this dimension, we will examine how tools like sktime and skforecast implement backtesting methodologies such as sliding windows and expanding windows. Additionally, we'll look at how these tools support the modeling process itself, including model selection and validation.

Using sktime and skforecast as practical examples, this talk will not only introduce these tools but also guide attendees through the broader considerations of tool selection. The objective is to emphasize the importance of Data Understanding, Data Preparation, and Backtesting in building effective and reliable forecasting pipelines.

Outline:

• Introduction to time series forecasting and its significance.

• Overview of the open-source forecasting tools landscape.

• Detailed examination of the three key dimensions:

  1. Data Understanding: EDA capabilities and their role in forecasting.

  2. Data Preparation: Handling common data challenges effectively.

  3. Modeling & Backtesting: Ensuring model reliability through robust evaluation methods.

• Energy management case study using sktime and skforecast.

• Practical recommendations for selecting the right tools based on project requirements.

Takeaways: Attendees will gain a clear understanding of how to evaluate and select forecasting tools based on these three dimensions. They will also learn how sktime and skforecast manage these aspects of the forecasting pipeline, enabling them to make more informed decisions when developing or maintaining forecasting models.

Audience: This session is tailored for data scientists, machine learning engineers, and technical leads involved in time series forecasting. Some prior knowledge of time series concepts is helpful, though the talk will also provide valuable insights for those looking to deepen their understanding of forecasting tools.

Slides: https://github.com/udishadc/PyData-Slides


Prior Knowledge Expected

No previous knowledge expected

Udisha Dutta Chowdhury is pursuing a Master’s in Computer Systems Engineering at Northeastern University, Boston, specializing in IoT systems and Machine Learning. She is deeply passionate about machine learning and its applications in IoT systems.

Udisha holds a Bachelor’s degree in Electronics and Communication Engineering with a minor in Computer Science from PES University, Bangalore, India.

During the summer of 2024, she worked as a Data Science Intern at Schneider Electric, Andover, MA, collaborating with the AI Hub's offer management team. She developed Python-based technical products utilizing time series machine learning algorithms for IoT data, and designed frameworks to prototype and benchmark these algorithms on standardized datasets.

Previously, she was a Solution Delivery Analyst at Deloitte USI, where she focused on security analysis, incident response, and threat hunting.

Abhishek Murthy is currently a Senior Principal Data Scientist at Schneider Electric (SE) in Boston, Massachusetts USA. He is passionate about sustainability, with a focus on climate change. To that end, he develops Machine Learning (ML) algorithms on sensor data that are critical for the sustainability commitments of the Industrial Automation and Energy Management businesses of SE. He is also a lecturer at Northeastern University and teaches machine learning algorithms for the Internet of Things.

Abhishek received his PhD in Computer Science from Stony Brook University, State University of New York and MS in Computer Science from University at Buffalo. His doctoral research, which was part of a National Science Foundation Expedition in Computing, entailed developing algorithms for automatically establishing the input-to-output stability of dynamical systems.

He led the Data Science Algorithms team at WHOOP before joining SE. He also worked at Signify, formerly called Philips Lighting, as a Senior Data Scientist and led research on IoT applications for smart buildings. Abhishek has served on several conference review committees and NSF panels. His research includes several publications and research articles with more than 195 citations. He has been awarded 15 patents and has more than 45 applications pending.