PyData NYC 2024

Holistic root cause analysis of software breakages through structural causal modeling
11-07, 10:55–11:35 (US/Eastern), Central Park East

The ability to quickly identify and resolve breakages among interconnected microservices is critical for any tech organization running production software. Unfortunately, in most organizations, identifying the root cause of a breakage can take engineers hours of manually sifting through logs and dashboards. In this talk, we’ll describe a fast, automated, and holistic approach to root cause analysis via an ensemble of structural causal models. This talk should be relevant to anyone interested in causal modeling, the field of observability, reliability engineering, or anyone wanting to troubleshoot production software issues faster.


Slides for this talk can be found here: https://docs.google.com/presentation/d/1j5hoaayuPku2HDejuQaXnrq7sZA10IUZFp8jWPFLRoM/edit?usp=sharing

Interest in the field of observability has been exploding over the last few years, which is understandable because quickly identifying and resolving software problems is critical for tech organizations everywhere. Many organizations log the “golden signal” data of microservice health (e.g. latency, status codes, and request counts for APIs) and employ service monitoring dashboard tools, however, the task of identifying the root cause of any software breakage remains challenging and highly manual. Medium or large tech organizations will often have 100s or 1000s of interconnected microservices, and a breakage somewhere in this tech stack can cause cascading failures that are difficult to quickly explain.

In this talk we will give a brief overview of this common observability problem, describe a holistic, python-based, causal modeling solution we have developed, and demonstrate its efficacy in a simulated set of interconnected APIs. This talk will not be particularly math-heavy and should provide a friendly overview of the problem and solution. The only prior knowledge recommended for attendees is having modest experience of machine learning modeling in the typical open-source data stack, a modest understanding of graphs, and some experience working with APIs. Attendees should walk away with a better understanding of observability data, they will get a gentle introduction to structural causal modeling, and understand a novel causal approach to root cause analysis.

Talk Outline:

Describing the problem (5 minutes)

Introducing observability

Describing the “golden signals” of APIs (e.g. status codes, latency, traffic, etc)

Why root cause analysis (RCA) is hard

Introduction of Structural Causal Models (SCMs) (5 minutes)

The mechanics of SCMs

Describe approach for anomaly attribution / RCA

Describing our simulation of many interdependent APIs (5 minutes)

Sharing our dependency graph

Sharing assumptions

Show animated GIF of APIs working together

Sharing the holistic approach (5 minutes)

How to combine RCA results from multiple signals to create a holistic approach for fault localization.

Sharing results of the approach on a simulated breakage

Concluding remarks (5 minutes)

Limitations

Future directions

Q&A (5 minutes)


Prior Knowledge Expected

No previous knowledge expected

Roni is a former academic epidemiology researcher who has spent a decade employing causal modeling around the population-level effects of harmful environmental exposures. Since leaving the academic world, he's been loving his second life in the tech industry as a data scientist, and is currently Director of Data Science at Capital One. He loves contributing in the open-source community, mentoring junior data folks, and explaining the magic of data analysis and modeling to non-technical audiences.

Ethan is a Principal Machine Learning Engineer at Capital One in the observability space. With a Master's in Data Science from Northwestern University, he has over five years of experience building data pipelines and ML solutions for automated anomaly detection and customer impact analysis. His main areas of expertise are deep learning, time series forecasting, and natural language processing.

Aditya is a Principal Data Scientist in Capital One, building an automated observability platform that identifies application failure within the system. Throughout his career in the industry, he has an extensive experience in applying causal inference techniques to solve different use cases. This includes building a personalized promo allocation system, surge pricing platform, and root cause identification. Additionally, Aditya obtained his M.S in Computer Science from Columbia University with a specialization in Machine Learning. He actively contributes back to the Data Science community by providing mentorship to students interested in entering the field. He is also involved in pro-bono work, providing technical consultancy to non-profit organizations.