11-06, 15:10–16:40 (US/Eastern), Central Park West
Preparing data for LLM pretraining is most challenging and time consuming task. Data for pretraining is usually scraped from internet which is full of duplicates and having undesired contents like hate, abuse and profanity.
To produce a quality model, the collected data needs to go through the series of transformations to improve its quality, add attributes ( e.g. detect language ), make it safe ( e.g. remove spam ) and put it in a common format expected by training module.
Data Prep Kit (https://github.com/IBM/data-prep-kit ) aka DPK is a open source project ( created by IBM ) to transform the input data collected from internet (https://commoncrawl.org/ ) into data ready for training.
This hands-on tutorial will be a working session to understand Data preparation steps for LLMs, how Data Prep Kit(DPK) works, how to run DPK transforms in stages, how to scale data processing and how to build a new transform.
Session Outline
Introduction to Data Prep Kit: 5 mins
How DPK works: 10 min
Setting up Jupyter notebook environments: 5 mins
Introduction to Ray, ray basics and how DPK leverages Ray: 10 mins
Running DPK from data crawling to tokenization & how to scale: 25 mins
Bring Your Own Transform: 5 mins
Develop Basic Transform from scratch: 10 mins
How you can contribute to this Open Source project and QA – 10 mins
No previous knowledge expected
I work as a Senior Engineer, watsonx Data Engineering at IBM Research.
I enjoy learning new things, debugging and solving technical challenges. 19 Years of software development experience in technologies ranging from C/C++, hand held device programming to Big data analytics to AI
I am passionate about the playing Cricket, and you will find me on the ground most of the weekend mornings.