Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Model Training
You must set the configuration to be used for the training pipeline in
conf/config.yaml
, setting the training
configuration to
specify the file to be used for training purposes. You must include training
in stages
to run the training pipeline.
Set the
training
configuration to bert/<model_size>
for BERT
models. For example, to train a 110M BERT model you would use bert/110m
which specifies the training file
conf/training/bert/110m.yaml
. Update the configuration to adjust
the hyperparameters of the training runs. All supported model types and
sizes are stored in the directory conf/training
.
Slurm
Set the configuration for your Slurm cluster in conf/cluster/bcm.yaml
:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Set the job-specific training configurations in the run
section of
conf/training/<model_type>/<model_size>.yaml
:
run:
name: bert_110m
results_dir: ${base_results_dir}/${.name}
time_limit: "7-00:00:00"
dependency: "singleton"
To run only the training pipeline and not the data preparation,
evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- training
Then enter:
python3 main.py
Base Command Platform
Select the cluster-related configuration following the NGC
documentation. Then run python3 main.py
to launch the job
and override the training job values of any configurations that need to be updated.