Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Training
You specify the configuration to be used for the training pipeline by setting the training
configuration in conf/config.yaml
to the pathname of the file to be used for training.
training
must be included in stages
to run the training pipeline. The training
configuration must be set to t5/<model_size>
. For example, you could set t5/220m
which refers to the configuration file conf/training/t5/220m.yaml
.
Change the configurations to adjust the hyperparameters of the training runs.
All supported model types and sizes are stored in the directory conf/training
.
Slurm
Set the configuration for your Slurm cluster in conf/cluster/bcm.yaml
:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Set the job-specific training configurations in the run
section of conf/training/<model_type>/<model_size>.yaml
:
run:
name: t5_220m
results_dir: ${base_results_dir}/${.name}
time_limit: "7-00:00:00"
dependency: "singleton"
To run only the training pipeline and not the data preparation,
evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- training
Then enter:
python3 main.py
Base Command Platform
Select the cluster-related configuration according to the NGC
documentation. Then enter python3 main.py
to launch the job
and override the training job values of any configurations.