Step #2: Preprocess and Tokenize

Now that you have introduced synthetic data generation and had a brief preview of the dataset, the next step is tokenization and preprocessing. You will learn about GPT2 BPE optimizations over traditional implementations, perform ETL and tokenize dataframe columns.

Within your Jupyter lab tab, open 1_Megatron_Preprocessing notebook and run it.

Once you are done running through the notebook, proceed with Step #3.

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.