NVIDIA provides an easy-to-use yet powerful pipeline to perform distributed training of GPT, T5, and mT5 models across multiple nodes and GPUs, and well-established recipes for different sizes of models in which throughput has been maximized and the convergence properties of the models have been tested and confirmed.
Define the configuration used for the training pipeline by setting the training
configuration in conf/config.yaml
Setting the configuration to gpt3/5b
, specifies the configuration file as conf/training/gpt3/5b.yaml
.
Modify the configuration to adust the hyperparameters of the training runs.
All supported model types and sizes are stored in the directory conf/training
.
The value training
must be included in stages
to run the training pipeline.
Slurm
Define the configuration for a Slurm cluster in conf/cluster/bcm.yaml
:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Set the job-specific training configurations in
the run
section of conf/training/<model_type>/<model_size>.yaml
:
run:
name: gpt3_5b
results_dir: ${base_results_dir}/${.name}
time_limit: "1-12:00:00"
dependency: "singleton"
To run only the training pipeline and not the data preparation,
evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- training
Then enter:
python3 main.py
Base Command Platform
Select the cluster-related configuration according to the NGC
documentation. Then enter python3 main.py
to launch the
job and override the training job values of any configurations you need to change.
Kubernetes
Define the configuration for a Kubernetes cluster in conf/cluster/k8s.yaml
:
pull_secret: null # Kubernetes secret for the container registry to pull private containers.
shm_size: 512Gi # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.
nfs_server: null # Hostname or IP address for the NFS server where data is stored.
nfs_path: null # Path to store data in the NFS server.
ib_resource_name: "nvidia.com/hostdev" # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters.
ib_count: "8" # Specify the number of IB devices to include per node in each pod.
Set the job-specific training parameters in
the run
section of conf/training/<model_type>/<model_size>.yaml
:
run:
name: gpt3_5b
results_dir: ${base_results_dir}/${.name}
time_limit: "1-12:00:00"
dependency: "singleton"
Set the cluster
and cluster_type
settings to k8s
in conf/config.yaml
.
To run only the training pipeline and not the data preparation,
evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- training
Then enter:
python3 main.py
This will launch a Helm chart based on the training configurations which will spawn one pod for each node including any networking fabrics as specified in the cluster settings for distributed training.
You specify the configuration to be used for the training pipeline by setting the training
configuration in conf/config.yaml
to the pathname of the file to be used for training.
training
must be included in stages
to run the training pipeline. The training
configuration must be set to t5/<model_size>
. For example, you could set t5/220m
which refers to the configuration file conf/training/t5/220m.yaml
.
Change the configurations to adjust the hyperparameters of the training runs.
All supported model types and sizes are stored in the directory conf/training
.
Slurm
Set the configuration for your Slurm cluster in conf/cluster/bcm.yaml
:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Set the job-specific training configurations in the run
section of conf/training/<model_type>/<model_size>.yaml
:
run:
name: t5_220m
results_dir: ${base_results_dir}/${.name}
time_limit: "7-00:00:00"
dependency: "singleton"
To run only the training pipeline and not the data preparation,
evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- training
Then enter:
python3 main.py
Base Command Platform
Select the cluster-related configuration according to the NGC
documentation. Then enter python3 main.py
to launch the job
and override the training job values of any configurations.
You specify the configuration for the training pipeline in
conf/config.yaml
, setting the training
configuration to the pathname
of the file to be used for training. You must include training
in stages
to run the training pipeline.
Set the
training
configuration to mt5/<model_size>
. For example, to train a 390M model you would set it to mt5/390m
specifying the file
conf/training/mt5/390m.yaml
.
You can change the configuration to adjust
the hyperparameters of the training runs. All supported model types and
sizes are stored in the directory conf/training
.
Slurm
Set configuration for your Slurm cluster in
conf/cluster/bcm.yaml
:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Set the training job-specific configurations in the run
section of
conf/training/<model_type>/<model_size>.yaml
:
run:
name: mt5_390m
results_dir: ${base_results_dir}/${.name}
time_limit: "7-00:00:00"
dependency: "singleton"
To run only the training pipeline and not the data preparation,
evaluation, or inference pipelines, set conf/config.yaml
to:
stages:
- training
Then enter:
python3 main.py
Base Command Platform
Select the cluster-related configuration according to the NGC
documentation. Then enter python3 main.py
to launch the
job and override the training job values of any configurations that need to be updated.
You must set the configuration to be used for the training pipeline in
conf/config.yaml
, setting the training
configuration to
specify the file to be used for training purposes. You must include training
in stages
to run the training pipeline.
Set the
training
configuration to bert/<model_size>
for BERT
models. For example, to train a 110M BERT model you would use bert/110m
which specifies the training file
conf/training/bert/110m.yaml
. Update the configuration to adjust
the hyperparameters of the training runs. All supported model types and
sizes are stored in the directory conf/training
.
Slurm
Set the configuration for your Slurm cluster in conf/cluster/bcm.yaml
:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Set the job-specific training configurations in the run
section of
conf/training/<model_type>/<model_size>.yaml
:
run:
name: bert_110m
results_dir: ${base_results_dir}/${.name}
time_limit: "7-00:00:00"
dependency: "singleton"
To run only the training pipeline and not the data preparation,
evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- training
Then enter:
python3 main.py
Base Command Platform
Select the cluster-related configuration following the NGC
documentation. Then run python3 main.py
to launch the job
and override the training job values of any configurations that need to be updated.