Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Training with Predefined Configurations
NVIDIA provides configurations for five GPT model sizes: 126M, 5B, 20B, 40B, and 175B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.
The configurations are defined by configuration files in the directory conf/training/gpt3
. You can choose a configuration by selecting the training configuration in the conf/config.yaml
file.
For Base Command Platform, you must launch all jobs in multi-node mode.
For Kubernetes clusters, the cluster
and cluster_type
settings in conf/config.yaml
need to be set to k8s
.
126M configuration
The 126M model uses the bf16
data type. It can be trained in about 20
hours using 8 nodes with 8 GPUs per node. The model includes 12
transformer layers, a hidden size of 768, and 12 attention heads. The
sequence length is 2048, and the optimizer is Distributed Adam. This
model does not use any model parallelism. See the
configuration file gpt3/126m.yaml
for parameter details.
To train a 126M model on a Slurm or Kubernetes cluster, modify the
conf/config.yaml
file to set:
- training: gpt3/126m
stages:
- training
Then enter:
python3 main.py
To train a 126M GPT model on a Base Command Platform cluster with 8 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/126m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
To train with fewer or a different number of nodes, the relevant parameters can be adjusted either in the yaml config file or from the command line. See Resuming Training with a Different Number of Nodes for more information. For Base Command Platform, all jobs must be launched in multi-node mode.
400M_improved configuration
The 400M_improved model uses the bf16
data type. It can be trained in about 2 days
using 8 nodes with 8 GPUs per node. The model includes 24
transformer layers, a hidden size of 1024, and 16 attention heads. The
sequence length is 2048, and the optimizer is Distributed Adam. This
model does not use any model parallelism. See the
configuration file gpt3/400m_improved.yaml
for parameter details.
To train a 400M_improved model on a Slurm or Kubernetes cluster, modify the
conf/config.yaml
file to set:
- training: gpt3/400m_improved
stages:
- training
Then enter:
python3 main.py
To train a 400M_improved GPT model on a Base Command Platform cluster with 8 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/400m_improved \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
1B_improved configuration
The 1B_improved model uses the bf16
data type. It can be trained in about 3 days
hours using 8 nodes with 8 GPUs per node. The model includes 24
transformer layers, a hidden size of 2048, and 16 attention heads. The
sequence length is 2048, and the optimizer is Distributed Adam. This
model does not use any model parallelism. See the
configuration file gpt3/1b_improved.yaml
for parameter details.
To train a 1B_improved model on a Slurm or Kubernetes cluster, modify the
conf/config.yaml
file to set:
- training: gpt3/1b_improved
stages:
- training
Then enter:
python3 main.py
To train a 1B_improved GPT model on a Base Command Platform cluster with 8 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/1b_improved \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
5B configuration
The 5B model uses the bf16
data type. It can be trained in about 5 days
using 16 nodes with 8 GPUs per node. The model includes 24 transformer
layers, a hidden size of 4096, and 32 attention heads. The sequence
length is 2048, and the optimizer is Distributed Adam. This model uses
tensor parallelism of 1. For details on the parameters, see the configuration file 5b.yaml
.
To train a 5B GPT model, modify the conf/config.yaml
file to set:
- training: gpt3/5b
stages:
- training
Then enter:
python3 main.py
To train a 5B GPT model on Base Command Platform cluster on 16 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/5b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
7B_improved configuration
The 7B_improved model uses the bf16
data type. It can be trained in about 6 days
using 16 nodes with 8 GPUs per node. The model includes 32 transformer
layers, a hidden size of 4096, and 32 attention heads. The sequence
length is 2048, and the optimizer is Distributed Adam. This model uses
tensor parallelism of 2. For details on the parameters, see the configuration file 7b_improved.yaml
.
To train a 7B_improved GPT model, modify the conf/config.yaml
file to set:
- training: gpt3/7b_improved
stages:
- training
Then enter:
python3 main.py
To train a 7B_improved GPT model on Base Command Platform cluster on 16 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/7b_improved \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
20B configuration
The 20B model uses the bf16
data type. It can be trained in about 6 days
using 64 nodes with 8 GPUs per node. The model includes 44 transformer
layers, a hidden size of 6144, and 48 attention heads. The sequence
length is 2048, and the optimizer is Distributed Adam. This model uses
tensor parallelism of 4 and pipeline parallelism of 1. For details
on the parameters, see the configuration file 20b.yaml
.
To train a 20B GPT model, modify the conf/config.yaml
file to set:
- training: gpt3/20b
stages:
- training
Then enter:
python3 main.py
To train a 20B GPT model on Base Command Platform cluster on 64 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/20b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
40B configuration
The 40B model uses the bf16
data type. It can be trained in about 6 days
using 128 nodes with 8 GPUs per node. The model includes 48 transformer
layers, a hidden size of 8192, and 64 attention heads. The sequence
length is 2048, and the optimizer is Distributed Adam. This model uses
tensor parallelism of 8 and pipeline parallelism of 1. For details
on the parameters, see the configuration file 40b.yaml
.
To train a 40B GPT model, modify the conf/config.yaml
file to set:
- training: gpt3/40b
stages:
- training
Then enter:
python3 main.py
To train a 40B GPT model on Base Command Platform cluster on 128 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/40b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
40B_improved configuration
The 40B_improved model uses the bf16
data type. It can be trained in about 6 days
using 96 nodes with 8 GPUs per node. The model includes 48 transformer
layers, a hidden size of 8192, and 64 attention heads. The sequence
length is 2048, and the optimizer is Distributed Adam. This model uses
tensor parallelism of 8 and pipeline parallelism of 2. For details
on the parameters, see the configuration file 40b_improved.yaml
.
To train a 40B_improved GPT model, modify the conf/config.yaml
file to set:
- training: gpt3/40b_improved
stages:
- training
Then enter:
python3 main.py
To train a 40B_improved GPT model on Base Command Platform cluster on 96 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/40b_improved \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected (the number of replicas) when creating the job.
175B configuration
The 175B model uses the bf16
data type. It can be trained in about 24 days
using 128 nodes with 8 GPUs per node. The model includes 96
transformer layers, a hidden size of 12288, and 96 attention heads. The
sequence length is 2048, and the optimizer is Distributed Adam. This
model uses tensor parallelism of 8 and pipeline parallelism of 16. This
model uses interleaved pipeline scheduling, with a virtual pipeline
chunk size of 6. For details on the parameters, see the
configuration file 175b.yaml
.
To train a 175B GPT model, modify the conf/config.yaml
file to
set:
- training: gpt3/175b
stages:
- training
Then enter:
python3 main.py
To train a 175B GPT model on Base Command Platform cluster on 128 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/175b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
, and that the $NGC_ARRAY_SIZE
will use the number of nodes selected when creating the job (the number of replicas).
FP8 with Transformer Engine
NVIDIA Transformer Engine (TE) is an open source library for accelerating Transformer-based models on NVIDIA Hopper GPUs. It enables using 8-bit floating point (FP8) precision to provide better performance with lower memory utilization in both training and inference. TE is available on github.
You can now use fp8
data in the NeMo Framework to pre-train GPT models. For
example, if you want to turn on fp8
to pre-train a GPT3 5B model,
you can modify the gpt3/5b
training configuration in
conf/training/gpt3/5b.yaml
as follows:
## Transformer Engine
transformer_engine: True # turn on Transformer Engine
fp8: True # enables fp8 in TransformerLayer forward
fp8_e4m3: False # sets fp8_format = recipe.Format.E4M3
fp8_hybrid: True # sets fp8_format = recipe.Format.HYBRID
fp8_margin: 0 # scaling margin
fp8_interval: 1 # scaling update interval
fp8_amax_history_len: 32 # Number of steps for which amax history is recorded per tensor
fp8_amax_compute_algo: max # 'most_recent' or 'max'. Algorithm for computing amax from history
use_emha: False
NVIDIA has compared fp8
and bf16
precision, and has observed similar convergence behavior but significant speed-up with fp8
.