Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.

Migrate Data Configuration from NeMo 1.0 to NeMo 2.0

Data is configured in NeMo 2.0 using the DataModule classes. The LLM collection has pre-training and fine-tuning datamodules specialized for language modeling use cases.

NeMo 1.0 (Previous Release)

In NeMo 1.0, the data configuration was controlled via the YAML configuration file. A condensed example is shown below:

data:
    train_ds:
      file_names: "my/traindata1,my/traindata2"
      global_batch_size: 4
      micro_batch_size: 2
      shuffle: True
      num_workers: 8
      memmap_workers: 2
      pin_memory: True
      max_seq_length: 2048
      min_seq_length: 1
      concat_sampling_probabilities:
        - 0.75
        - 0.25
      ...
    validation_ds:
      file_names: "/my/validdata1"
      ...
      metric:
        name: "loss"
        average: null
        num_classes: null
    test_ds:
      file_names: "/my/testdata1"
      ...
      metric:
        name: "loss"
        average: null
        num_classes: null

NeMo 2.0 (New Release)

In 2.0, data is configured via the relevant DataModule. For example, setting up the DataModule for pretraining might look like:

from nemo.collections.llm.gpt.data import PreTrainingDataModule
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer


tokenizer = get_nmt_tokenizer("megatron", "GPT2BPETokenizer")

data =  PreTrainingDataModule(
    paths={
        "train": [0.75, '/my/traindata1', 0.25, '/my/traindata2'],
        "validation": '/my/validdata1',
        "test": '/my/testdata`',
    },
    global_batch_size=4,
    micro_batch_size=2,
    shuffle=True,
    num_workers=8,
    memmap_workers=2,
    pin_memory=True,
    seq_length=2048,
    tokenizer=tokenizer
)

For a full list of arguments supported for pre-training and fine-tuning, please refer to the PreTrainingDataModule and FineTuningDataModule documentation.

Important

If you have already processed a dataset for NeMo 1.0, you can use the same data paths in NeMo 2.0. No changes have been made to the offline data preparation steps.

Migration Steps

  1. Remove the data section from your YAML configuration file.

  2. Import the necessary modules in your Python script:

from nemo.collections.llm.gpt.data import PreTrainingDataModule
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
  1. Create an instance of PreTrainingDataModule or FineTuningDataModule, and map the arguments from your YAML file to the datamodule.

  1. If using the PreTrainingDataModule, map the dataset paths and weights from your YAML file to paths. Take the following NeMo 1.0 YAML config as an example.

    train_ds:
      file_names: "my/traindata1,my/traindata2"
      concat_sampling_probabilities:
          - 0.75
          - 0.25
      ...
    validation_ds:
      file_names: "/my/validdata1"
      ...
    test_ds:
      file_names: "/my/testdata1"
      ...
    

This NeMo 1.0 YAML config becomes a dictionary mapping each split to a list of paths in NeMo 2.0. If concat_sampling_probabilities is provided in the YAML file, the probabilities are zipped with the paths to create a flat list:

paths={
  "train": [0.75, "my/traindata1", 0.25, "my/traindata2"]
  "validation": ["/my/validdata1"]
  "test": ["/my/testdata1"]
}
  1. If using the FineTuningDataModule, dataset_root should point to a directory which contains training.jsonl, validation.jsonl and test.jsonl, processed to the same format as they are in NeMo 1. These file names are configurable with the train_path, validation_path, and test_path properties. Alternatively, NeMo 2.0 provides dataset-specific classes which will take care of downloading, preprocessing and splitting the dataset automatically. Supported datasets can be found here.

  1. Pass the data object to the llm.train function.

Some Notes on Migration

  • With the current design, users are expected to specify a single dataloading configuration to be shared across train, validation, and test data. In other words, the arguments that are passed to the DataModule constructor will be used for all three dataloaders. This differs from NeMo 1.0 where users can specify different configurations per split.

  • A few of the parameters that were exposed to users in NeMo 1.0 are currently not configurable in NeMo 2.0. They are:

  • min_seq_length (defaults to 1 in 2.0)

  • label_key (defaults to output in 2.0)

  • add_eos (defaults to True)

  • add_bos (defaults to False)

  • add_sep (defaults to False)

  • truncation_field (defaults to input)

  • prompt_template (defaults to '{input} {output}')

  • drop_last (defaults to True)

  • tokens_to_generate (unsupported)

  • metric (unsupported)

  • write_predictions_to_file (unsupported)