Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Migrate Data Configuration from NeMo 1.0 to NeMo 2.0#
Data is configured in NeMo 2.0 using the DataModule
classes. The LLM collection has pre-training and fine-tuning datamodules specialized for language modeling use cases.
NeMo 1.0 (Previous Release)#
In NeMo 1.0, the data configuration was controlled via the YAML configuration file. A condensed example is shown below:
data:
train_ds:
file_names: "my/traindata1,my/traindata2"
global_batch_size: 4
micro_batch_size: 2
shuffle: True
num_workers: 8
memmap_workers: 2
pin_memory: True
max_seq_length: 2048
min_seq_length: 1
concat_sampling_probabilities:
- 0.75
- 0.25
...
validation_ds:
file_names: "/my/validdata1"
...
metric:
name: "loss"
average: null
num_classes: null
test_ds:
file_names: "/my/testdata1"
...
metric:
name: "loss"
average: null
num_classes: null
NeMo 2.0 (New Release)#
In 2.0, data is configured via the relevant DataModule
. For example, setting up the DataModule for pretraining might look like:
from nemo.collections.llm.gpt.data import PreTrainingDataModule
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
tokenizer = get_nmt_tokenizer("megatron", "GPT2BPETokenizer")
data = PreTrainingDataModule(
paths={
"train": [0.75, '/my/traindata1', 0.25, '/my/traindata2'],
"validation": '/my/validdata1',
"test": '/my/testdata`',
},
global_batch_size=4,
micro_batch_size=2,
num_workers=8,
pin_memory=True,
seq_length=2048,
tokenizer=tokenizer
)
For a full list of arguments supported for pre-training and fine-tuning, please refer to the PreTrainingDataModule and FineTuningDataModule documentation.
Important
If you have already processed a dataset for NeMo 1.0, you can use the same data paths in NeMo 2.0. No changes have been made to the offline data preparation steps.
Migration Steps#
Remove the
data
section from your YAML configuration file.Import the necessary modules in your Python script:
from nemo.collections.llm.gpt.data import PreTrainingDataModule from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
Create an instance of
PreTrainingDataModule
orFineTuningDataModule
, and map the arguments from your YAML file to the datamodule.
If using the
PreTrainingDataModule
, map the dataset paths and weights from your YAML file topaths
. Take the following NeMo 1.0 YAML config as an example.train_ds: file_names: "my/traindata1,my/traindata2" concat_sampling_probabilities: - 0.75 - 0.25 ... validation_ds: file_names: "/my/validdata1" ... test_ds: file_names: "/my/testdata1" ...This NeMo 1.0 YAML config becomes a dictionary mapping each split to a list of paths in NeMo 2.0. If
concat_sampling_probabilities
is provided in the YAML file, the probabilities are zipped with the paths to create a flat list:paths={ "train": [0.75, "my/traindata1", 0.25, "my/traindata2"] "validation": ["/my/validdata1"] "test": ["/my/testdata1"] }
If using the
FineTuningDataModule
,dataset_root
should point to a directory which contains training.jsonl, validation.jsonl and test.jsonl, processed to the same format as they are in NeMo 1. These file names are configurable with thetrain_path
,validation_path
, andtest_path
properties. Alternatively, NeMo 2.0 provides dataset-specific classes which will take care of downloading, preprocessing and splitting the dataset automatically. Supported datasets can be found here.
Pass the
data
object to the llm.train function.
Some Notes on Migration#
With the current design, users are expected to specify a single dataloading configuration to be shared across train, validation, and test data. In other words, the arguments that are passed to the
DataModule
constructor will be used for all three dataloaders. This differs from NeMo 1.0 where users can specify different configurations per split.A few of the parameters that were exposed to users in NeMo 1.0 are currently not configurable in NeMo 2.0. They are:
min_seq_length
(defaults to1
in 2.0)
label_key
(defaults tooutput
in 2.0)
add_eos
(defaults toTrue
)
add_bos
(defaults toFalse
)
add_sep
(defaults toFalse
)
truncation_field
(defaults toinput
)
prompt_template
(defaults to'{input} {output}'
)
drop_last
(defaults toTrue
)
tokens_to_generate
(unsupported)
metric
(unsupported)
write_predictions_to_file
(unsupported)