PyData NYC 2024

Reproducible work environments for data scientists using Nix
11-07, 11:40–12:20 (US/Eastern), Music Box

As a data scientist, it is important to communicate results and findings with stakeholders and other cross-functional teams. However, it is also crucial to reproduce the results among ourselves and different teams. Unfortunately, different operating systems, package versions, and package managers make it hard to run the same module seamlessly across different machines. In this talk, we will explore how to create reproducible work using Nix and how it can be a very useful tool for any Data Scientist or ML Engineer.


Audience

This talk is for Data Scientists and Machine Learning Engineers at any level. Basic knowledge of Docker containers is helpful but optional.

Take Away

Attendees will learn why reproducibility is important and how to use Nix’s features in daily work.

Details

In recent years, containerization using tools like Docker has become a cornerstone for deploying applications efficiently. One reason for its popularity is the ability to create a consistent and deterministic environment in production.
However, reproducible environments can also be beneficial for development environments. Reproducibility often becomes complex due to the diverse range of operating systems, package versions, dependency managers, and Python versions. What runs smoothly on one machine may fail on another, leading to inconsistencies that can hinder collaboration and slow down development velocity.

We will begin the talk by exploring the fundamental challenges of maintaining consistency in development setups. While Docker has been widely adopted for its ability to encapsulate production environments, it has limitations, especially in development. We will briefly discuss these limitations to set the stage for why a tool like Nix is needed.
From there, we shall dive into the Nix ecosystem, introducing key components such as Nixpkgs, NixOS, and the Nix language. We will see how these elements work together to create environments that are not only consistent across different machines but also highly customizable and reproducible.

After that, we will put theory into practice by building a sample data science project. Step by step, we'll make the development environment reproducible using Nix, demonstrating how this tool can streamline collaboration and ensure that your code runs identically, regardless of the underlying system. Finally, we will discuss other extensions and processes we can run with Nix.

Outline

  1. Introduction and motivation [1 min]
  2. Why do we need reproducible behavior? [3 min]
  3. Why is it hard to have a deterministic environment [2 min]
  4. Docker containers [3 min]
    • Why is it widely used?
    • Where is it useful?
    • Where does it fall short for dev environments?
  5. Nix concepts [5 min]
    • What is it?
    • Benefits
    • Ecosystem
    • Isolated environments
  6. Sample Data Science project [2 min]
  7. Show non deterministic behavior of the project [3 min]
  8. Introduce Nix in the project [4 min]
    • Install Nix
    • Write nix file using nix language
  9. Spin up nix environment [1 min]
  10. Run project and verify reproducibility [2 min]
  11. Further use cases and extensions [4 min]
    • CI/CD integration
    • Handling different Python versions
    • Collaboration

https://github.com/ab93/nix-datascience/tree/main


Prior Knowledge Expected

No previous knowledge expected

Avik is a seasoned data scientist, having worked in multiple different domains of machine learning. He loves coding in Python, and writing elegant and scalable code.