Model Training

Define the configuration used for the training pipeline by setting the training configuration in conf/config.yaml Setting the configuration to gpt3/5b, specifies the configuration file as conf/training/gpt3/5b.yaml. Modify the configuration to adust the hyperparameters of the training runs. All supported model types and sizes are stored in the directory conf/training.

The value training must be included in stages to run the training pipeline.

Slurm

Define the configuration for a Slurm cluster in conf/cluster/bcm.yaml:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Set the job-specific training configurations in the run section of conf/training/<model_type>/<model_size>.yaml:

run:
    name: gpt3_5b
    results_dir: ${base_results_dir}/${.name}
    time_limit: "1-12:00:00"
    dependency: "singleton"

To run only the training pipeline and not the data preparation, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

stages:
  - training

Then enter:

python3 main.py

Base Command Platform

Select the cluster-related configuration according to the NGC documentation. Then enter python3 main.py to launch the job and override the training job values of any configurations you need to change.

Kubernetes

Define the configuration for a Kubernetes cluster in conf/cluster/k8s.yaml:

pull_secret: null  # Kubernetes secret for the container registry to pull private containers.
shm_size: 512Gi  # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.
nfs_server: null  # Hostname or IP address for the NFS server where data is stored.
nfs_path: null  # Path to store data in the NFS server.
ib_resource_name: "nvidia.com/hostdev"  # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters.
ib_count: "8"  # Specify the number of IB devices to include per node in each pod.

Set the job-specific training parameters in the run section of conf/training/<model_type>/<model_size>.yaml:

run:
    name: gpt3_5b
    results_dir: ${base_results_dir}/${.name}
    time_limit: "1-12:00:00"
    dependency: "singleton"

Set the cluster and cluster_type settings to k8s in conf/config.yaml.

To run only the training pipeline and not the data preparation, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

stages:
  - training

Then enter:

python3 main.py

This will launch a Helm chart based on the training configurations which will spawn one pod for each node including any networking fabrics as specified in the cluster settings for distributed training.