Training

You specify the configuration for the training pipeline in conf/config.yaml, setting the training configuration to the pathname of the file to be used for training. You must include training in stages to run the training pipeline.

Set the training configuration to mt5/<model_size>. For example, to train a 390M model you would set it to mt5/390m specifying the file conf/training/mt5/390m.yaml.

You can change the configuration to adjust the hyperparameters of the training runs. All supported model types and sizes are stored in the directory conf/training.

Slurm

Set configuration for your Slurm cluster in conf/cluster/bcm.yaml:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Set the training job-specific configurations in the run section of conf/training/<model_type>/<model_size>.yaml:

run:
    name: mt5_390m
    results_dir: ${base_results_dir}/${.name}
    time_limit: "7-00:00:00"
    dependency: "singleton"

To run only the training pipeline and not the data preparation, evaluation, or inference pipelines, set conf/config.yaml to:

stages:
  - training

Then enter:

python3 main.py

Base Command Platform

Select the cluster-related configuration according to the NGC documentation. Then enter python3 main.py to launch the job and override the training job values of any configurations that need to be updated.