Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Training
You specify the configuration for the training pipeline in
conf/config.yaml
, setting the training
configuration to the pathname
of the file to be used for training. You must include training
in stages
to run the training pipeline.
Set the
training
configuration to mt5/<model_size>
. For example, to train a 390M model you would set it to mt5/390m
specifying the file
conf/training/mt5/390m.yaml
.
You can change the configuration to adjust
the hyperparameters of the training runs. All supported model types and
sizes are stored in the directory conf/training
.
Slurm
Set configuration for your Slurm cluster in
conf/cluster/bcm.yaml
:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Set the training job-specific configurations in the run
section of
conf/training/<model_type>/<model_size>.yaml
:
run:
name: mt5_390m
results_dir: ${base_results_dir}/${.name}
time_limit: "7-00:00:00"
dependency: "singleton"
To run only the training pipeline and not the data preparation,
evaluation, or inference pipelines, set conf/config.yaml
to:
stages:
- training
Then enter:
python3 main.py
Base Command Platform
Select the cluster-related configuration according to the NGC
documentation. Then enter python3 main.py
to launch the
job and override the training job values of any configurations that need to be updated.