Preparing data for LLM training with Data Prep Kit PyData NYC 2024

Preparing data for LLM training with Data Prep Kit
.ical

11-08, 11:40–12:20 (US/Eastern), Music Box

Preparing data for LLM pretraining is most challenging and time consuming task. Data for pretraining is usually scraped from internet which is full of duplicates and having undesired contents like hate, abuse and profanity.

To produce a quality model, the collected data needs to go through the series of transformations to improve its quality, add attributes ( e.g. detect language ), make it safe ( e.g. remove spam ) and put it in a common format expected by training module.

Data Prep Kit (https://github.com/IBM/data-prep-kit ) aka DPK is a open source project ( created by IBM ) to transform the input data collected from internet (https://commoncrawl.org/ ) into data ready for training.

Session Outline:

LLM data prep stages: 5 mins
Introduction to DPK & how it works : 10 min
Project Structure & transform example: 5 mins
Appeal for open source contributors and how and what areas needs contributors: 5 mins
QA - 5 mins

Prior Knowledge Expected –

No previous knowledge expected

See also: Data Prep Kits

Santosh Borse

I work as a Senior Engineer, watsonx Data Engineering at IBM Research.

I enjoy learning new things, debugging and solving technical challenges. 19 Years of software development experience in technologies ranging from C/C++, hand held device programming to Big data analytics to AI

I am passionate about the playing Cricket, and you will find me on the ground most of the weekend mornings.

This speaker also appears in:

Preparing Data for LLM pretraining using open source Data Prep Kit

Preparing data for LLM training with Data Prep Kit .ical 11-08, 11:40–12:20 (US/Eastern), Music Box

Preparing data for LLM training with Data Prep Kit
.ical

11-08, 11:40–12:20 (US/Eastern), Music Box