Model Training - NVIDIA Docs

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide Model Training

You must set the configuration to be used for the training pipeline in conf/config.yaml, setting the training configuration to specify the file to be used for training purposes. You must include training in stages to run the training pipeline.

Set the training configuration to bert/<model_size> for BERT models. For example, to train a 110M BERT model you would use bert/110m which specifies the training file conf/training/bert/110m.yaml. Update the configuration to adjust the hyperparameters of the training runs. All supported model types and sizes are stored in the directory conf/training.

Slurm

Set the configuration for your Slurm cluster in conf/cluster/bcm.yaml:

Copy
Copied!

            
            partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Set the job-specific training configurations in the run section of conf/training/<model_type>/<model_size>.yaml:

Copy
Copied!

            
            run:
    name: bert_110m
    results_dir: ${base_results_dir}/${.name}
    time_limit: "7-00:00:00"
    dependency: "singleton"

To run only the training pipeline and not the data preparation, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!

            
            stages:
  - training

Then enter:

Copy
Copied!

            
            python3 main.py

Base Command Platform

Select the cluster-related configuration following the NGC documentation. Then run python3 main.py to launch the job and override the training job values of any configurations that need to be updated.

Previous Data Preparation

Next Training with Predefined Configurations