Pretraining Datasets#

Overview#

The PreTrainingDataModule is a base class in NeMo 2 for pretraining Large Language Models (LLMs). NeMo 2 requires all pretraining datasets to be pre-tokenized for best performance. This class integrates with PyTorch Lightning’s LightningDataModule and Megatron’s GPTDataset.

Data Preprocessing#

NeMo’s pretraining datasets use the same format as Megatron Core. All data must be pre-processed and tokenized before training.

First, place your training data in a loose JSON format, with one JSON containing a text sample per line. For example:

The name of the text field of the JSON can be changed by using the --json-keys flag in preprocess_data_for_megatron.py. The other metadata are optional and not saved or used in training.

The loose JSON is then processed into a binary format for training. To convert the JSON into mmap format, use preprocess_data_for_megatron.py.

Preprocess for GPT Model#

An example script to prepare data for GPT training is:

python scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
    --input=my-corpus.json \
    --json-keys=text \
    --tokenizer-library=megatron \
    --tokenizer-type=GPT2BPETokenizer \
    --dataset-impl=mmap \
    --output-prefix=my-gpt \
    --append-eod \
    --workers=48

The above script takes my-corpus.json and converts it into two files named, in this case, my-gpt_text_sentence.bin and my-gpt_text_sentence.idx. It uses GPT2BPETokenizer as the tokenizer, and it adds EOD token to the end of the document. It parallelizes 48 workers to speed up the preprocessing.

Important: Make sure you have the tokenizer set up correctly. For usage of any Hugging Face tokenizer, you can set “–tokenizer-library=huggingface” and “–tokenizer_type=huggingface-tokenizer-model”.

Preprocess for BERT Model#

An example script to prepare data for BERT training is:

python scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
    --input=my-corpus.json \
    --json-keys=text \
    --tokenizer-library=megatron \
    --tokenizer-type=BertWordPieceCase \
    --dataset-impl=mmap \
    --output-prefix=my-bert \
    --split-sentences \
    --workers=48

The big difference between BERT dataset and GPT dataset is that BERT datasets are split by sentence with the flag –split-sentences.

Use PreTrainingDataModule with Preprocessed Data#

For GPT model, use nemo/collections/llm/gpt/data/pre_training.py::PreTrainingDataModule
For BERT model, use nemo/collections/llm/bert/data/pre_training.py::BERTPreTrainingDataModule
For T5 model, use nemo/collections/llm/t5/data/pre_training.py:PreTrainingDataModule

Initialize your data module with the preprocessed data, as well as any additional kwargs, if needed. For example, for GPT model:

PreTrainingDataModule(
    paths=["my-gpt_text_sentence"],
    seq_length=512,
    micro_batch_size=1,
    global_batch_size=128,
    dataset_kwargs={},
    split="95,3,2"
)

Note

paths can be either a single path, a list of paths, or a dictionary. If a single path or a list of paths, the given paths will be used to generate the train, validation and test datasets. If providing a list of paths, the format can be either (1) a list of paths, e.g. ["path/to/dataset_1_prefix", "path/to/dataset_2_prefix"], or (2) a flattened, zipped list of weights and paths, e.g. ["30", "path/to/dataset_1_prefix", "70", "path/to/dataset_2_prefix"]. If a dictionary is provided, it is expected to have ‘train’, ‘validation’, ‘test’ as the key, and each value is either a path or a list of paths as described above. In this case, each split will be generated using the given paths.
split is a string of 3 comma-separated integers denoting how much of the distribution to allocate to train, validation, and test sets, respectively.