Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Training with Predefined Configurations

NVIDIA provides configurations for five GPT model sizes: 126M, 5B, 20B, 40B, and 175B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.

The configurations are defined by configuration files in the directory conf/training/gpt3. You can choose a configuration by selecting the training configuration in the conf/config.yaml file.

For Base Command Platform, you must launch all jobs in multi-node mode.

For Kubernetes clusters, the cluster and cluster_type settings in conf/config.yaml need to be set to k8s.

126M configuration

The 126M model uses the bf16 data type. It can be trained in about 20 hours using 8 nodes with 8 GPUs per node. The model includes 12 transformer layers, a hidden size of 768, and 12 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file gpt3/126m.yaml for parameter details.

To train a 126M model on a Slurm or Kubernetes cluster, modify the conf/config.yaml file to set:

- training: gpt3/126m
stages:
  - training

Then enter:

python3 main.py

To train a 126M GPT model on a Base Command Platform cluster with 8 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/126m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

To train with fewer or a different number of nodes, the relevant parameters can be adjusted either in the yaml config file or from the command line. See Resuming Training with a Different Number of Nodes for more information. For Base Command Platform, all jobs must be launched in multi-node mode.

400M_improved configuration

The 400M_improved model uses the bf16 data type. It can be trained in about 2 days using 8 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 1024, and 16 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file gpt3/400m_improved.yaml for parameter details.

To train a 400M_improved model on a Slurm or Kubernetes cluster, modify the conf/config.yaml file to set:

- training: gpt3/400m_improved
stages:
  - training

Then enter:

python3 main.py

To train a 400M_improved GPT model on a Base Command Platform cluster with 8 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/400m_improved \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

1B_improved configuration

The 1B_improved model uses the bf16 data type. It can be trained in about 3 days hours using 8 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 2048, and 16 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file gpt3/1b_improved.yaml for parameter details.

To train a 1B_improved model on a Slurm or Kubernetes cluster, modify the conf/config.yaml file to set:

- training: gpt3/1b_improved
stages:
  - training

Then enter:

python3 main.py

To train a 1B_improved GPT model on a Base Command Platform cluster with 8 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/1b_improved \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

5B configuration

The 5B model uses the bf16 data type. It can be trained in about 5 days using 16 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 4096, and 32 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 1. For details on the parameters, see the configuration file 5b.yaml.

To train a 5B GPT model, modify the conf/config.yaml file to set:

- training: gpt3/5b
stages:
  - training

Then enter:

python3 main.py

To train a 5B GPT model on Base Command Platform cluster on 16 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/5b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

7B_improved configuration

The 7B_improved model uses the bf16 data type. It can be trained in about 6 days using 16 nodes with 8 GPUs per node. The model includes 32 transformer layers, a hidden size of 4096, and 32 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 2. For details on the parameters, see the configuration file 7b_improved.yaml.

To train a 7B_improved GPT model, modify the conf/config.yaml file to set:

- training: gpt3/7b_improved
stages:
  - training

Then enter:

python3 main.py

To train a 7B_improved GPT model on Base Command Platform cluster on 16 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/7b_improved \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

20B configuration

The 20B model uses the bf16 data type. It can be trained in about 6 days using 64 nodes with 8 GPUs per node. The model includes 44 transformer layers, a hidden size of 6144, and 48 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 4 and pipeline parallelism of 1. For details on the parameters, see the configuration file 20b.yaml.

To train a 20B GPT model, modify the conf/config.yaml file to set:

- training: gpt3/20b
stages:
  - training

Then enter:

python3 main.py

To train a 20B GPT model on Base Command Platform cluster on 64 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/20b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

40B configuration

The 40B model uses the bf16 data type. It can be trained in about 6 days using 128 nodes with 8 GPUs per node. The model includes 48 transformer layers, a hidden size of 8192, and 64 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 8 and pipeline parallelism of 1. For details on the parameters, see the configuration file 40b.yaml.

To train a 40B GPT model, modify the conf/config.yaml file to set:

- training: gpt3/40b
stages:
  - training

Then enter:

python3 main.py

To train a 40B GPT model on Base Command Platform cluster on 128 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/40b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

40B_improved configuration

The 40B_improved model uses the bf16 data type. It can be trained in about 6 days using 96 nodes with 8 GPUs per node. The model includes 48 transformer layers, a hidden size of 8192, and 64 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 8 and pipeline parallelism of 2. For details on the parameters, see the configuration file 40b_improved.yaml.

To train a 40B_improved GPT model, modify the conf/config.yaml file to set:

- training: gpt3/40b_improved
stages:
  - training

Then enter:

python3 main.py

To train a 40B_improved GPT model on Base Command Platform cluster on 96 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/40b_improved \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

175B configuration

The 175B model uses the bf16 data type. It can be trained in about 24 days using 128 nodes with 8 GPUs per node. The model includes 96 transformer layers, a hidden size of 12288, and 96 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 8 and pipeline parallelism of 16. This model uses interleaved pipeline scheduling, with a virtual pipeline chunk size of 6. For details on the parameters, see the configuration file 175b.yaml.

To train a 175B GPT model, modify the conf/config.yaml file to set:

- training: gpt3/175b
stages:
  - training

Then enter:

python3 main.py

To train a 175B GPT model on Base Command Platform cluster on 128 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=gpt3/175b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected when creating the job (the number of replicas).

FP8 with Transformer Engine

NVIDIA Transformer Engine (TE) is an open source library for accelerating Transformer-based models on NVIDIA Hopper GPUs. It enables using 8-bit floating point (FP8) precision to provide better performance with lower memory utilization in both training and inference. TE is available on github.

You can now use fp8 data in the NeMo Framework to pre-train GPT models. For example, if you want to turn on fp8 to pre-train a GPT3 5B model, you can modify the gpt3/5b training configuration in conf/training/gpt3/5b.yaml as follows:

## Transformer Engine
transformer_engine: True # turn on Transformer Engine
fp8: True # enables fp8 in TransformerLayer forward
fp8_e4m3: False # sets fp8_format = recipe.Format.E4M3
fp8_hybrid: True # sets fp8_format = recipe.Format.HYBRID
fp8_margin: 0 # scaling margin
fp8_interval: 1 # scaling update interval
fp8_amax_history_len: 32 # Number of steps for which amax history is recorded per tensor
fp8_amax_compute_algo: max # 'most_recent' or 'max'. Algorithm for computing amax from history
use_emha: False

NVIDIA has compared fp8 and bf16 precision, and has observed similar convergence behavior but significant speed-up with fp8.