PyData NYC 2024

Preparing Data for LLM pretraining using open source Data Prep Kit
11-06, 15:10–16:40 (US/Eastern), Central Park West

Preparing data for LLM pretraining is most challenging and time consuming task. Data for pretraining is usually scraped from internet which is full of duplicates and having undesired contents like hate, abuse and profanity.

To produce a quality model, the collected data needs to go through the series of transformations to improve its quality, add attributes ( e.g. detect language ), make it safe ( e.g. remove spam ) and put it in a common format expected by training module.

Data Prep Kit (https://github.com/IBM/data-prep-kit ) aka DPK is a open source project ( created by IBM ) to transform the input data collected from internet (https://commoncrawl.org/ ) into data ready for training.

This hands-on tutorial will be a working session to understand Data preparation steps for LLMs, how Data Prep Kit(DPK) works, how to run DPK transforms in stages, how to scale data processing and how to build a new transform.


Session Outline

Introduction to Data Prep Kit: 5 mins

How DPK works: 10 min

Setting up Jupyter notebook environments: 5 mins

Introduction to Ray, ray basics and how DPK leverages Ray: 10 mins

Running DPK from data crawling to tokenization & how to scale: 25 mins

Bring Your Own Transform: 5 mins

Develop Basic Transform from scratch: 10 mins

How you can contribute to this Open Source project and QA – 10 mins


Prior Knowledge Expected

No previous knowledge expected

See also: data-prep-kit

I work as a Senior Engineer, watsonx Data Engineering at IBM Research.

I enjoy learning new things, debugging and solving technical challenges. 19 Years of software development experience in technologies ranging from C/C++, hand held device programming to Big data analytics to AI

I am passionate about the playing Cricket, and you will find me on the ground most of the weekend mornings.

This speaker also appears in: