11-08, 11:40–12:20 (US/Eastern), Music Box
Preparing data for LLM pretraining is most challenging and time consuming task. Data for pretraining is usually scraped from internet which is full of duplicates and having undesired contents like hate, abuse and profanity.
To produce a quality model, the collected data needs to go through the series of transformations to improve its quality, add attributes ( e.g. detect language ), make it safe ( e.g. remove spam ) and put it in a common format expected by training module.
Data Prep Kit (https://github.com/IBM/data-prep-kit ) aka DPK is a open source project ( created by IBM ) to transform the input data collected from internet (https://commoncrawl.org/ ) into data ready for training.
Session Outline:
LLM data prep stages: 5 mins
Introduction to DPK & how it works : 10 min
Project Structure & transform example: 5 mins
Appeal for open source contributors and how and what areas needs contributors: 5 mins
QA - 5 mins
No previous knowledge expected
I work as a Senior Engineer, watsonx Data Engineering at IBM Research.
I enjoy learning new things, debugging and solving technical challenges. 19 Years of software development experience in technologies ranging from C/C++, hand held device programming to Big data analytics to AI
I am passionate about the playing Cricket, and you will find me on the ground most of the weekend mornings.