Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Training

To run training, first set the value of stages variable in conf/config.yaml to “training”. To specify the training hyperparameters, set the training variable in conf/config.yaml to the corresponding training configuration yaml file path. Next, set the training variable to the retro/<model_size>, which refers to the training configuration file at conf/training/retro/<model_size>.yaml. For example, to train the 300m parameters RETRO model, set training in the conf/config.yaml file to retro/300m, which points the to the conf/training/retro/300m.yaml file.

The training process will use the output directory from the data preparation step. We set the path to this directory at the retro.retro_project_dir variable in the training configuration yaml file. To keep the data processing process consistent between data preparation step and training step, many of the data hyperparameters will be retrieved from the config.json file in this directory (including data splits, sequence length, chunk length, number of training and validating samples, tokenizer, etc.) and will override any values set manually in the training configuration file.

The following section describes the general instructions for running training on a Slurm. For more detailed instructions on running a specific model size, see Training with Predefined Configurations.

Slurm

To run only the training pipeline and exclude the data preparation, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
job_name_prefix: 'nemo-megatron-'
srun_args:
- "--no-container-mount-home"

Set the job-specific training configurations in the run section of conf/training/<model_type>/<model_size>.yaml:

run:
    name: retro_300m
    results_dir: ${base_results_dir}/${.name}
    time_limit: "7-00:00:00"
    dependency: "singleton"

To run only the training pipeline and not the data preparation, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

stages:
  - training

Then, enter the following command:

python3 main.py