PyData NYC 2024

Santosh Borse

I work as a Senior Engineer, watsonx Data Engineering at IBM Research.

I enjoy learning new things, debugging and solving technical challenges. 19 Years of software development experience in technologies ranging from C/C++, hand held device programming to Big data analytics to AI

I am passionate about the playing Cricket, and you will find me on the ground most of the weekend mornings.

The speaker's profile picture

Sessions

11-06
15:10
90min
Preparing Data for LLM pretraining using open source Data Prep Kit
Santosh Borse

Preparing data for LLM pretraining is most challenging and time consuming task. Data for pretraining is usually scraped from internet which is full of duplicates and having undesired contents like hate, abuse and profanity.

To produce a quality model, the collected data needs to go through the series of transformations to improve its quality, add attributes ( e.g. detect language ), make it safe ( e.g. remove spam ) and put it in a common format expected by training module.

Data Prep Kit (https://github.com/IBM/data-prep-kit ) aka DPK is a open source project ( created by IBM ) to transform the input data collected from internet (https://commoncrawl.org/ ) into data ready for training.

This hands-on tutorial will be a working session to understand Data preparation steps for LLMs, how Data Prep Kit(DPK) works, how to run DPK transforms in stages, how to scale data processing and how to build a new transform.

Central Park West
11-08
11:40
40min
Preparing data for LLM training with Data Prep Kit
Santosh Borse

Preparing data for LLM pretraining is most challenging and time consuming task. Data for pretraining is usually scraped from internet which is full of duplicates and having undesired contents like hate, abuse and profanity.

To produce a quality model, the collected data needs to go through the series of transformations to improve its quality, add attributes ( e.g. detect language ), make it safe ( e.g. remove spam ) and put it in a common format expected by training module.

Data Prep Kit (https://github.com/IBM/data-prep-kit ) aka DPK is a open source project ( created by IBM ) to transform the input data collected from internet (https://commoncrawl.org/ ) into data ready for training.

Music Box