Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Training
To run training, first set the value of stages
variable in conf/config.yaml
to “training”.
To specify the training hyperparameters, set the training
variable in conf/config.yaml
to the corresponding training configuration yaml file path. Next, set the training
variable to the retro/<model_size>
, which refers to the training configuration file at conf/training/retro/<model_size>.yaml
. For example, to train the 300m parameters RETRO model, set training
in the conf/config.yaml
file to retro/300m
, which points the to the conf/training/retro/300m.yaml
file.
The training process will use the output directory from the data preparation step. We set the path to this directory at the retro.retro_project_dir
variable in the training configuration yaml file. To keep the data processing process consistent between data preparation step and training step, many of the data hyperparameters will be retrieved from the config.json
file in this directory (including data splits, sequence length, chunk length, number of training and validating samples, tokenizer, etc.) and will override any values set manually in the training configuration file.
The following section describes the general instructions for running training on a Slurm. For more detailed instructions on running a specific model size, see Training with Predefined Configurations.
Slurm
To run only the training pipeline and exclude the data preparation,
evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
job_name_prefix: 'nemo-megatron-'
srun_args:
- "--no-container-mount-home"
Set the job-specific training configurations in the run
section of conf/training/<model_type>/<model_size>.yaml
:
run:
name: retro_300m
results_dir: ${base_results_dir}/${.name}
time_limit: "7-00:00:00"
dependency: "singleton"
To run only the training pipeline and not the data preparation,
evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- training
Then, enter the following command:
python3 main.py