NVIDIA provides configurations for five GPT model sizes: 126M, 5B, 20B, 40B, and 175B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.
The configurations are defined by configuration files in the directory conf/training/gpt3
. You can choose a configuration by selecting the training configuration in the conf/config.yaml
file.
For Base Command Platform, you must launch all jobs in multi-node mode.
For Kubernetes clusters, the cluster
and cluster_type
settings in conf/config.yaml
need to be set to k8s
.
126M configuration
The 126M model uses the bf16
data type. It can be trained in about 20
hours using 8 nodes with 8 GPUs per node. The model includes 12
transformer layers, a hidden size of 768, and 12 attention heads. The
sequence length is 2048, and the optimizer is Distributed Adam. This
model does not use any model parallelism. See the
configuration file gpt3/126m.yaml
for parameter details.
To train a 126M model on a Slurm or Kubernetes cluster, modify the
conf/config.yaml
file to set:
- training: gpt3/126m
stages:
- training
Then enter:
python3 main.py
To train a 126M GPT model on a Base Command Platform cluster with 8 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/126m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
To train with fewer or a different number of nodes, the relevant parameters can be adjusted either in the yaml config file or from the command line. See Resuming Training with a Different Number of Nodes for more information. For Base Command Platform, all jobs must be launched in multi-node mode.
400M_improved configuration
The 400M_improved model uses the bf16
data type. It can be trained in about 2 days
using 8 nodes with 8 GPUs per node. The model includes 24
transformer layers, a hidden size of 1024, and 16 attention heads. The
sequence length is 2048, and the optimizer is Distributed Adam. This
model does not use any model parallelism. See the
configuration file gpt3/400m_improved.yaml
for parameter details.
To train a 400M_improved model on a Slurm or Kubernetes cluster, modify the
conf/config.yaml
file to set:
- training: gpt3/400m_improved
stages:
- training
Then enter:
python3 main.py
To train a 400M_improved GPT model on a Base Command Platform cluster with 8 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/400m_improved \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
1B_improved configuration
The 1B_improved model uses the bf16
data type. It can be trained in about 3 days
hours using 8 nodes with 8 GPUs per node. The model includes 24
transformer layers, a hidden size of 2048, and 16 attention heads. The
sequence length is 2048, and the optimizer is Distributed Adam. This
model does not use any model parallelism. See the
configuration file gpt3/1b_improved.yaml
for parameter details.
To train a 1B_improved model on a Slurm or Kubernetes cluster, modify the
conf/config.yaml
file to set:
- training: gpt3/1b_improved
stages:
- training
Then enter:
python3 main.py
To train a 1B_improved GPT model on a Base Command Platform cluster with 8 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/1b_improved \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
5B configuration
The 5B model uses the bf16
data type. It can be trained in about 5 days
using 16 nodes with 8 GPUs per node. The model includes 24 transformer
layers, a hidden size of 4096, and 32 attention heads. The sequence
length is 2048, and the optimizer is Distributed Adam. This model uses
tensor parallelism of 1. For details on the parameters, see the configuration file 5b.yaml
.
To train a 5B GPT model, modify the conf/config.yaml
file to set:
- training: gpt3/5b
stages:
- training
Then enter:
python3 main.py
To train a 5B GPT model on Base Command Platform cluster on 16 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/5b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
7B_improved configuration
The 7B_improved model uses the bf16
data type. It can be trained in about 6 days
using 16 nodes with 8 GPUs per node. The model includes 32 transformer
layers, a hidden size of 4096, and 32 attention heads. The sequence
length is 2048, and the optimizer is Distributed Adam. This model uses
tensor parallelism of 2. For details on the parameters, see the configuration file 7b_improved.yaml
.
To train a 7B_improved GPT model, modify the conf/config.yaml
file to set:
- training: gpt3/7b_improved
stages:
- training
Then enter:
python3 main.py
To train a 7B_improved GPT model on Base Command Platform cluster on 16 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/7b_improved \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
20B configuration
The 20B model uses the bf16
data type. It can be trained in about 6 days
using 64 nodes with 8 GPUs per node. The model includes 44 transformer
layers, a hidden size of 6144, and 48 attention heads. The sequence
length is 2048, and the optimizer is Distributed Adam. This model uses
tensor parallelism of 4 and pipeline parallelism of 1. For details
on the parameters, see the configuration file `` 20b.yaml``.
To train a 20B GPT model, modify the conf/config.yaml
file to set:
- training: gpt3/20b
stages:
- training
Then enter:
python3 main.py
To train a 20B GPT model on Base Command Platform cluster on 64 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/20b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
40B configuration
The 40B model uses the bf16
data type. It can be trained in about 6 days
using 128 nodes with 8 GPUs per node. The model includes 48 transformer
layers, a hidden size of 8192, and 64 attention heads. The sequence
length is 2048, and the optimizer is Distributed Adam. This model uses
tensor parallelism of 8 and pipeline parallelism of 1. For details
on the parameters, see the configuration file `` 40b.yaml``.
To train a 40B GPT model, modify the conf/config.yaml
file to set:
- training: gpt3/40b
stages:
- training
Then enter:
python3 main.py
To train a 40B GPT model on Base Command Platform cluster on 128 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/40b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
40B_improved configuration
The 40B_improved model uses the bf16
data type. It can be trained in about 6 days
using 96 nodes with 8 GPUs per node. The model includes 48 transformer
layers, a hidden size of 8192, and 64 attention heads. The sequence
length is 2048, and the optimizer is Distributed Adam. This model uses
tensor parallelism of 8 and pipeline parallelism of 2. For details
on the parameters, see the configuration file `` 40b_improved.yaml``.
To train a 40B_improved GPT model, modify the conf/config.yaml
file to set:
- training: gpt3/40b_improved
stages:
- training
Then enter:
python3 main.py
To train a 40B_improved GPT model on Base Command Platform cluster on 96 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/40b_improved \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
175B configuration
The 175B model uses the bf16
data type. It can be trained in about 24 days
using 128 nodes with 8 GPUs per node. The model includes 96
transformer layers, a hidden size of 12288, and 96 attention heads. The
sequence length is 2048, and the optimizer is Distributed Adam. This
model uses tensor parallelism of 8 and pipeline parallelism of 16. This
model uses interleaved pipeline scheduling, with a virtual pipeline
chunk size of 6. For details on the parameters, see the
configuration file 175b.yaml
.
To train a 175B GPT model, modify the conf/config.yaml
file to
set:
- training: gpt3/175b
stages:
- training
Then enter:
python3 main.py
To train a 175B GPT model on Base Command Platform cluster on 128 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/175b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected when creating the job (the number of replicas).
FP8 with Transformer Engine
NVIDIA Transformer Engine (TE) is an open source library for accelerating Transformer-based models on NVIDIA Hopper GPUs. It enables using 8-bit floating point (FP8) precision to provide better performance with lower memory utilization in both training and inference. TE is available on github.
You can now use fp8
data in the NeMo framework to pre-train GPT models. For
example, if you want to turn on fp8
to pre-train a GPT3 5B model,
you can modify the gpt3/5b
training configuration in
conf/training/gpt3/5b.yaml
as follows:
## Transformer Engine
transformer_engine: True # turn on Transformer Engine
fp8: True # enables fp8 in TransformerLayer forward
fp8_e4m3: False # sets fp8_format = recipe.Format.E4M3
fp8_hybrid: True # sets fp8_format = recipe.Format.HYBRID
fp8_margin: 0 # scaling margin
fp8_interval: 1 # scaling update interval
fp8_amax_history_len: 32 # Number of steps for which amax history is recorded per tensor
fp8_amax_compute_algo: max # 'most_recent' or 'max'. Algorithm for computing amax from history
use_emha: False
NVIDIA has compared fp8
and bf16
precision, and has observed similar convergence behavior but significant speed-up with fp8
.
NVIDIA provides configuration for five T5 model sizes: 220M, 3B, 11B, 23B, and 41B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.
The configurations are defined by configuration files in the directory conf/training/t5
. You can choose a configuration by selecting the training configuration in the conf/config.yaml
file.
For Base Command Platform, you must launch all jobs in multi-node mode.
220M configuration
The 220M model uses the bf16
data type. It can be trained in about 3.5 days
using 4 nodes with 8 GPUs per node. The model includes 12
transformer layers, a hidden size of 768, a feedforward network size of
2048, and 12 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. This
model does not use any model parallelism. See the configuration file
t5/220m.yaml
for parameter details.
To train a 220M model on a Slurm cluster, modify the
conf/config.yaml
file to set:
training: t5/220m
stages:
- training
Then enter:
python3 main.py
To train a 220M model on a Base Command Platform cluster with 4 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/220m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
To train with a different number of nodes, you can change the relevant parameters (e.g. micro_batch_size
) either in the appropriate YAML file or from the command line. See
Resuming Training with a Different Number of Nodes
for more information. For Base Command Platform, all jobs must be launched in multi-node mode.
3B configuration
The 3B model uses the bf16
data type. It can be trained in about 7.5 days
using 20 nodes with 8 GPUs per node. The model includes 24
transformer layers, a hidden size of 2048, a feedforward network size of
5120, and 32 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. For
details on the parameters, see the configuration file t5/3b.yaml
.
To train a 3B model, modify the conf/config.yaml
file to set:
training: t5/3b
stages:
- training
Then enter:
python3 main.py
To train a 3B model on a Base Command Platform cluster with 20 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/3b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
11B configuration
The 11B model uses the bf16
data type. It can be trained in about 26.5 days
using 20 nodes with 8 GPUs per node. The model includes 24
transformer layers, a hidden size of 4096, a feedforward network size of
10240, and 64 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. This
model uses tensor parallelism of 4. For details on the
parameters, see the configuration file t5/11b.yaml
.
To train a 11B model, modify the conf/config.yaml
file to set:
training: t5/11b
stages:
- training
Then enter:
python3 main.py
To train a 11B model on Base Command Platform cluster on 20 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/11b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
23B configuration
The 23B model uses the bf16
data type. It can be trained in about 36 days
using 40 nodes with 8 GPUs per node. The model includes 36
transformer layers, a hidden size of 5120, a feedforward network size of
10880, and 64 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. This
model uses tensor parallelism of 4 and pipeline parallelism of 2. For
details on the parameters, see the configuration file t5/23b.yaml
.
To train a 23B model, modify the conf/config.yaml
file to set:
training: t5/23b
stages:
- training
Then enter:
python3 main.py
To train a 23B model on Base Command Platform cluster on 40 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/23b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
41B configuration
The 41B model uses the bf16
data type. It can be trained in about 60 days
using 40 nodes with 8 GPUs per node. The model includes 36
transformer layers, a hidden size of 6144, a feedforward network size of
10880, and 96 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. This
model uses tensor parallelism of 4 and pipeline parallelism of 2. For
details on the parameters, see the configuration file t5/23b.yaml
.
To train a 41B model, modify the conf/config.yaml
file to set:
training: t5/41b
stages:
- training
Then enter:
python3 main.py
To train a 41B model on Base Command Platform cluster on 40 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/41b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
NVIDIA provides configuration for three mT5 model sizes: 170M, 390M, and 3B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.
The configurations are defined by configuration files in the directory conf/training/mt5
. You can choose a configuration by selecting the training configuration in the conf/config.yaml
file.
For Base Command Platform, you must launch all jobs in multi-node mode.
170M configuration
The 170M model uses the bf16
data type. It can be trained in about 4 days
using 4 nodes with 8 GPUs per node. The model includes 8
transformer layers, a hidden size of 512, a feedforward network size of
1024, and 6 attention heads with GeGLU activation function. The sequence
length is 512, and the optimizer is Distributed Adam. This model does
not use any model parallelism. See the configuration file mt5/170m.yaml
for
parameter details.
To train a 170M model on a Slurm cluster, modify the
conf/config.yaml
file to set:
training: mt5/170m
stages:
- training
Then enter:
python3 main.py
To train a 170M model on Base Command Platform cluster on 4 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=mt5/170m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
To train with a different number of nodes, You can change the relevant parameters (e.g. micro_batch_size
) either in the appropriate YAML configuration file or from the command line. See
Resuming Training with a Different Number of Nodes
for more information. For Base Command Platform, you must launch all jobs in multi-node mode.
390M configuration
The 390M model uses the bf16
data type. It can be trained in about 4 days
using 8 nodes with 8 GPUs per node. The model includes 8
transformer layers, a hidden size of 512, a feedforward network size of
2048, and 12 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. This
model does not use any model parallelism. See the configuration file mt5/390m.yaml
for parameter details.
To train a 390M model on a Slurm cluster, modify the
conf/config.yaml
file to set:
training: mt5/390m
stages:
- training
Then enter:
python3 main.py
To train a 390M model on Base Command Platform cluster on 8 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=mt5/390m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
3B configuration
The 3B model uses the bf16
data type. It can be trained in about 14 days
using 20 nodes with 8 GPUs per node. The model includes 24 transformer
layers, a hidden size of 2048, a feedforward network size of 5120, and
32 attention heads with GeGLU activation function. The sequence length
is 512, and the optimizer is Distributed Adam. This model uses tensor
parallelism of 2. For details on the parameters, see the configuration file mt5/3b.yaml
.
To train a 3B model, modify the conf/config.yaml
file to set:
training: mt5/3b
stages:
- training
Then enter:
python3 main.py
To train a 3B model on Base Command Platform cluster on 20 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=mt5/3b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
The training code can log the model- and system-related metrics to both
TensorBoard and Weights & Biases (W&B). The local files are stored
in the directory specified in the
training.exp_manager.explicit_log_dir
parameter. TensorBoard logs
are saved by default.
However, the W&B API key must be specified for W&B to work properly. To
upload the logs to W&B, you must first store the API key in the first (normally the only) line of a text
file and set the wandb_api_key_file
parameter to the file’s pathname. For
Base Command Platform, you can store this file in a dataset or workspace
mounted for the job.
You must set the following training configurations to enable logging of training metrics to W&B:
exp_manager:
create_wandb_logger: True
wandb_logger_kwargs:
project: [W&B project name]
name: [W&B run name]
The logs show reduced_train_loss
, val_loss
, train_step_timing
metrics, and other relevant metrics. train_step_timing
is the measure pf the time it takes to finish each global
step.
NVIDIA provides configuration for four BERT model sizes: 110M, 4B, 20B, and 100B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.
The configurations are defined by configuration files in the directory conf/training/bert
. You can choose a configuration by selecting the training configuration in the conf/config.yaml
file.
For Base Command Platform, you must launch all jobs in multi-node mode.
110M configuration
The 110M model uses the bf16
data type. The model includes 12
transformer layers, a hidden size of 768, a feedforward network size of
3072 and 12 attention heads with GeGLU activation function. The sequence
length is 512, and the optimizer is Distributed Adam. This model does
not use any model parallelism. See the configuration file bert/110m.yaml
for parameter details.
To train a 110M model on a Slurm cluster, modify the
conf/config.yaml
file to set:
training: bert/110m
stages:
- training
Then enter:
python3 main.py
To train a 110M model on Base Command Platform cluster on 4 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=bert/110m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
To train with a different number of nodes, the relevant parameters (e.g. micro_batch_size
) can be adjusted either in the appropriate yaml config file or from the command line. See
Resuming Training with a Different Number of Nodes
for more information. For Base Command Platform, all jobs must be launched in multi-node mode.
4B configuration
The 4B model uses the bf16
data type. The model includes 48 transformer
layers, a hidden size of 2560, a feedforward network size of 10240, and
40 attention heads with GeGLU activation function. The sequence length
is 512, and the optimizer is Distributed Adam. For the details on all
the parameters, see the bert/4b.yaml
config file.
To train a 4B model, modify the conf/config.yaml
file to set:
training: bert/4b
stages:
- training
Then enter:
python3 main.py
To train a 4B model on Base Command Platform cluster on 20 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=bert/4b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
20B configuration
The 20B model uses the bf16
data type. The model includes 48 transformer
layers, a hidden size of 6144, a feedforward network size of 24576, and
96 attention heads with GeGLU activation function. The sequence length
is 512, and the optimizer is Distributed Adam. For the details on all
the parameters, see the bert/20b.yaml
config file.
To train a 20B model, modify the conf/config.yaml
file to set:
training: bert/20b
stages:
- training
Then enter:
python3 main.py
To train a 20B model on Base Command Platform cluster on 20 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=bert/20b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
100B configuration
The 100B model uses the bf16
data type. The model includes 96
transformer layers, a hidden size of 9216, a feedforward network size of
36864, and 96 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. For
details on the parameters, see the configuration file bert/100b.yaml
.
To train a 100B model, modify the conf/config.yaml
file to set:
training: bert/100b
stages:
- training
Then enter:
python3 main.py
To train a 100B model on Base Command Platform cluster on 20 nodes, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=bert/100b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).