Model Training - NVIDIA Docs

Define the configuration used for the training pipeline by setting the training configuration in conf/config.yaml Setting the configuration to gpt3/5b, specifies the configuration file as conf/training/gpt3/5b.yaml. Modify the configuration to adust the hyperparameters of the training runs. All supported model types and sizes are stored in the directory conf/training.

The value training must be included in stages to run the training pipeline.

Slurm

Define the configuration for a Slurm cluster in conf/cluster/bcm.yaml:

Copy
Copied!

            
            partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Set the job-specific training configurations in the run section of conf/training/<model_type>/<model_size>.yaml:

Copy
Copied!

            
            run:
    name: gpt3_5b
    results_dir: ${base_results_dir}/${.name}
    time_limit: "1-12:00:00"
    dependency: "singleton"

To run only the training pipeline and not the data preparation, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!

            
            stages:
  - training

Then enter:

Copy
Copied!

            
            python3 main.py

Base Command Platform

Select the cluster-related configuration according to the NGC documentation. Then enter python3 main.py to launch the job and override the training job values of any configurations you need to change.

Kubernetes

Define the configuration for a Kubernetes cluster in conf/cluster/k8s.yaml:

Copy
Copied!

            
            pull_secret: null  # Kubernetes secret for the container registry to pull private containers.
shm_size: 512Gi  # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.
nfs_server: null  # Hostname or IP address for the NFS server where data is stored.
nfs_path: null  # Path to store data in the NFS server.
ib_resource_name: "nvidia.com/hostdev"  # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters.
ib_count: "8"  # Specify the number of IB devices to include per node in each pod.

Set the job-specific training parameters in the run section of conf/training/<model_type>/<model_size>.yaml:

Copy
Copied!

            
            run:
    name: gpt3_5b
    results_dir: ${base_results_dir}/${.name}
    time_limit: "1-12:00:00"
    dependency: "singleton"

Set the cluster and cluster_type settings to k8s in conf/config.yaml.

To run only the training pipeline and not the data preparation, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!

            
            stages:
  - training

Then enter:

Copy
Copied!

            
            python3 main.py

This will launch a Helm chart based on the training configurations which will spawn one pod for each node including any networking fabrics as specified in the cluster settings for distributed training.