Pretraining Datasets#
Overview#
The PreTrainingDataModule
is a base class in NeMo 2 for pretraining Large Language Models (LLMs). NeMo 2
requires all pretraining datasets to be pre-tokenized for best performance. This class integrates with PyTorch
Lightning’s LightningDataModule
and Megatron’s GPTDataset
.
Data Preprocessing#
NeMo’s pretraining datasets use the same format as Megatron Core. All data must be pre-processed and tokenized before training.
First, place your training data in a loose JSON format, with one JSON containing a text sample per line. For example:
The name of the text
field of the JSON can be changed by using the --json-keys
flag in preprocess_data_for_megatron.py
.
The other metadata are optional and not saved or used in training.
The loose JSON is then processed into a binary format for training. To convert the JSON into mmap format, use preprocess_data_for_megatron.py
.
Preprocess for GPT Model#
An example script to prepare data for GPT training is:
python scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input=my-corpus.json \
--json-keys=text \
--tokenizer-library=megatron \
--tokenizer-type=GPT2BPETokenizer \
--dataset-impl=mmap \
--output-prefix=my-gpt \
--append-eod \
--workers=48
The above script takes my-corpus.json
and converts it into two files named, in this case, my-gpt_text_sentence.bin
and my-gpt_text_sentence.idx
.
It uses GPT2BPETokenizer
as the tokenizer, and it adds EOD token to the end of the document.
It parallelizes 48 workers to speed up the preprocessing.
Important: Make sure you have the tokenizer set up correctly. For usage of any Hugging Face tokenizer, you can set “–tokenizer-library=huggingface” and “–tokenizer_type=huggingface-tokenizer-model”.
Preprocess for BERT Model#
An example script to prepare data for BERT training is:
python scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input=my-corpus.json \
--json-keys=text \
--tokenizer-library=megatron \
--tokenizer-type=BertWordPieceCase \
--dataset-impl=mmap \
--output-prefix=my-bert \
--split-sentences \
--workers=48
The big difference between BERT dataset and GPT dataset is that BERT datasets are split by sentence with the flag –split-sentences.
Use PreTrainingDataModule with Preprocessed Data#
For GPT model, use
nemo/collections/llm/gpt/data/pre_training.py::PreTrainingDataModule
For BERT model, use
nemo/collections/llm/bert/data/pre_training.py::BERTPreTrainingDataModule
For T5 model, use
nemo/collections/llm/t5/data/pre_training.py:PreTrainingDataModule
Initialize your data module with the preprocessed data, as well as any additional kwargs, if needed. For example, for GPT model:
PreTrainingDataModule(
paths=["my-gpt_text_sentence"],
seq_length=512,
micro_batch_size=1,
global_batch_size=128,
dataset_kwargs={},
split="95,3,2"
)
Note
paths
can be either a single path, a list of paths, or a dictionary. If a single path or a list of paths, the given paths will be used to generate the train, validation and test datasets. If providing a list of paths, the format can be either (1) a list of paths, e.g.["path/to/dataset_1_prefix", "path/to/dataset_2_prefix"]
, or (2) a flattened, zipped list of weights and paths, e.g.["30", "path/to/dataset_1_prefix", "70", "path/to/dataset_2_prefix"]
. If a dictionary is provided, it is expected to have ‘train’, ‘validation’, ‘test’ as the key, and each value is either a path or a list of paths as described above. In this case, each split will be generated using the given paths.split
is a string of 3 comma-separated integers denoting how much of the distribution to allocate to train, validation, and test sets, respectively.