Training

You specify the configuration to be used for the training pipeline by setting the training configuration in conf/config.yaml to the pathname of the file to be used for training.

training must be included in stages to run the training pipeline. The training configuration must be set to t5/<model_size>. For example, you could set t5/220m which refers to the configuration file conf/training/t5/220m.yaml.

Change the configurations to adjust the hyperparameters of the training runs. All supported model types and sizes are stored in the directory conf/training.

Slurm

Set the configuration for your Slurm cluster in conf/cluster/bcm.yaml:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Set the job-specific training configurations in the run section of conf/training/<model_type>/<model_size>.yaml:

run:
    name: t5_220m
    results_dir: ${base_results_dir}/${.name}
    time_limit: "7-00:00:00"
    dependency: "singleton"

To run only the training pipeline and not the data preparation, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

stages:
  - training

Then enter:

python3 main.py

Base Command Platform

Select the cluster-related configuration according to the NGC documentation. Then enter python3 main.py to launch the job and override the training job values of any configurations.