Training with Predefined Configurations

NVIDIA provides configurations for five GPT model sizes: 126M, 5B, 20B, 40B, and 175B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.

The configurations are defined by configuration files in the directory conf/training/gpt3. You can choose a configuration by selecting the training configuration in the conf/config.yaml file.

For Base Command Platform, you must launch all jobs in multi-node mode.

For Kubernetes clusters, the cluster and cluster_type settings in conf/config.yaml need to be set to k8s.

126M configuration

The 126M model uses the bf16 data type. It can be trained in about 20 hours using 8 nodes with 8 GPUs per node. The model includes 12 transformer layers, a hidden size of 768, and 12 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file gpt3/126m.yaml for parameter details.

To train a 126M model on a Slurm or Kubernetes cluster, modify the conf/config.yaml file to set:

Copy
Copied!
            

- training: gpt3/126m stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 126M GPT model on a Base Command Platform cluster with 8 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/126m \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \ training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

To train with fewer or a different number of nodes, the relevant parameters can be adjusted either in the yaml config file or from the command line. See Resuming Training with a Different Number of Nodes for more information. For Base Command Platform, all jobs must be launched in multi-node mode.

400M_improved configuration

The 400M_improved model uses the bf16 data type. It can be trained in about 2 days using 8 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 1024, and 16 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file gpt3/400m_improved.yaml for parameter details.

To train a 400M_improved model on a Slurm or Kubernetes cluster, modify the conf/config.yaml file to set:

Copy
Copied!
            

- training: gpt3/400m_improved stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 400M_improved GPT model on a Base Command Platform cluster with 8 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/400m_improved \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \ training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

1B_improved configuration

The 1B_improved model uses the bf16 data type. It can be trained in about 3 days hours using 8 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 2048, and 16 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file gpt3/1b_improved.yaml for parameter details.

To train a 1B_improved model on a Slurm or Kubernetes cluster, modify the conf/config.yaml file to set:

Copy
Copied!
            

- training: gpt3/1b_improved stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 1B_improved GPT model on a Base Command Platform cluster with 8 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/1b_improved \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \ training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

5B configuration

The 5B model uses the bf16 data type. It can be trained in about 5 days using 16 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 4096, and 32 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 1. For details on the parameters, see the configuration file 5b.yaml.

To train a 5B GPT model, modify the conf/config.yaml file to set:

Copy
Copied!
            

- training: gpt3/5b stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 5B GPT model on Base Command Platform cluster on 16 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/5b \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \ training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

7B_improved configuration

The 7B_improved model uses the bf16 data type. It can be trained in about 6 days using 16 nodes with 8 GPUs per node. The model includes 32 transformer layers, a hidden size of 4096, and 32 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 2. For details on the parameters, see the configuration file 7b_improved.yaml.

To train a 7B_improved GPT model, modify the conf/config.yaml file to set:

Copy
Copied!
            

- training: gpt3/7b_improved stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 7B_improved GPT model on Base Command Platform cluster on 16 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/7b_improved \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \ training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

20B configuration

The 20B model uses the bf16 data type. It can be trained in about 6 days using 64 nodes with 8 GPUs per node. The model includes 44 transformer layers, a hidden size of 6144, and 48 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 4 and pipeline parallelism of 1. For details on the parameters, see the configuration file `` 20b.yaml``.

To train a 20B GPT model, modify the conf/config.yaml file to set:

Copy
Copied!
            

- training: gpt3/20b stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 20B GPT model on Base Command Platform cluster on 64 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/20b \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \ training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

40B configuration

The 40B model uses the bf16 data type. It can be trained in about 6 days using 128 nodes with 8 GPUs per node. The model includes 48 transformer layers, a hidden size of 8192, and 64 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 8 and pipeline parallelism of 1. For details on the parameters, see the configuration file `` 40b.yaml``.

To train a 40B GPT model, modify the conf/config.yaml file to set:

Copy
Copied!
            

- training: gpt3/40b stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 40B GPT model on Base Command Platform cluster on 128 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/40b \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \ training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

40B_improved configuration

The 40B_improved model uses the bf16 data type. It can be trained in about 6 days using 96 nodes with 8 GPUs per node. The model includes 48 transformer layers, a hidden size of 8192, and 64 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 8 and pipeline parallelism of 2. For details on the parameters, see the configuration file `` 40b_improved.yaml``.

To train a 40B_improved GPT model, modify the conf/config.yaml file to set:

Copy
Copied!
            

- training: gpt3/40b_improved stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 40B_improved GPT model on Base Command Platform cluster on 96 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/40b_improved \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \ training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected (the number of replicas) when creating the job.

175B configuration

The 175B model uses the bf16 data type. It can be trained in about 24 days using 128 nodes with 8 GPUs per node. The model includes 96 transformer layers, a hidden size of 12288, and 96 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 8 and pipeline parallelism of 16. This model uses interleaved pipeline scheduling, with a virtual pipeline chunk size of 6. For details on the parameters, see the configuration file 175b.yaml.

To train a 175B GPT model, modify the conf/config.yaml file to set:

Copy
Copied!
            

- training: gpt3/175b stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 175B GPT model on Base Command Platform cluster on 128 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/175b \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \ training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results, and that the $NGC_ARRAY_SIZE will use the number of nodes selected when creating the job (the number of replicas).

FP8 with Transformer Engine

NVIDIA Transformer Engine (TE) is an open source library for accelerating Transformer-based models on NVIDIA Hopper GPUs. It enables using 8-bit floating point (FP8) precision to provide better performance with lower memory utilization in both training and inference. TE is available on github.

You can now use fp8 data in the NeMo framework to pre-train GPT models. For example, if you want to turn on fp8 to pre-train a GPT3 5B model, you can modify the gpt3/5b training configuration in conf/training/gpt3/5b.yaml as follows:

Copy
Copied!
            

## Transformer Engine transformer_engine: True # turn on Transformer Engine fp8: True # enables fp8 in TransformerLayer forward fp8_e4m3: False # sets fp8_format = recipe.Format.E4M3 fp8_hybrid: True # sets fp8_format = recipe.Format.HYBRID fp8_margin: 0 # scaling margin fp8_interval: 1 # scaling update interval fp8_amax_history_len: 32 # Number of steps for which amax history is recorded per tensor fp8_amax_compute_algo: max # 'most_recent' or 'max'. Algorithm for computing amax from history use_emha: False

NVIDIA has compared fp8 and bf16 precision, and has observed similar convergence behavior but significant speed-up with fp8.

NVIDIA provides configuration for five T5 model sizes: 220M, 3B, 11B, 23B, and 41B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.

The configurations are defined by configuration files in the directory conf/training/t5. You can choose a configuration by selecting the training configuration in the conf/config.yaml file.

For Base Command Platform, you must launch all jobs in multi-node mode.

220M configuration

The 220M model uses the bf16 data type. It can be trained in about 3.5 days using 4 nodes with 8 GPUs per node. The model includes 12 transformer layers, a hidden size of 768, a feedforward network size of 2048, and 12 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file t5/220m.yaml for parameter details.

To train a 220M model on a Slurm cluster, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: t5/220m stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 220M model on a Base Command Platform cluster with 4 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/220m \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

To train with a different number of nodes, you can change the relevant parameters (e.g. micro_batch_size) either in the appropriate YAML file or from the command line. See Resuming Training with a Different Number of Nodes for more information. For Base Command Platform, all jobs must be launched in multi-node mode.

3B configuration

The 3B model uses the bf16 data type. It can be trained in about 7.5 days using 20 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 2048, a feedforward network size of 5120, and 32 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. For details on the parameters, see the configuration file t5/3b.yaml.

To train a 3B model, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: t5/3b stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 3B model on a Base Command Platform cluster with 20 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/3b \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

11B configuration

The 11B model uses the bf16 data type. It can be trained in about 26.5 days using 20 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 4096, a feedforward network size of 10240, and 64 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model uses tensor parallelism of 4. For details on the parameters, see the configuration file t5/11b.yaml.

To train a 11B model, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: t5/11b stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 11B model on Base Command Platform cluster on 20 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/11b \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

23B configuration

The 23B model uses the bf16 data type. It can be trained in about 36 days using 40 nodes with 8 GPUs per node. The model includes 36 transformer layers, a hidden size of 5120, a feedforward network size of 10880, and 64 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model uses tensor parallelism of 4 and pipeline parallelism of 2. For details on the parameters, see the configuration file t5/23b.yaml.

To train a 23B model, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: t5/23b stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 23B model on Base Command Platform cluster on 40 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/23b \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

41B configuration

The 41B model uses the bf16 data type. It can be trained in about 60 days using 40 nodes with 8 GPUs per node. The model includes 36 transformer layers, a hidden size of 6144, a feedforward network size of 10880, and 96 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model uses tensor parallelism of 4 and pipeline parallelism of 2. For details on the parameters, see the configuration file t5/23b.yaml.

To train a 41B model, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: t5/41b stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 41B model on Base Command Platform cluster on 40 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/41b \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

NVIDIA provides configuration for three mT5 model sizes: 170M, 390M, and 3B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.

The configurations are defined by configuration files in the directory conf/training/mt5. You can choose a configuration by selecting the training configuration in the conf/config.yaml file.

For Base Command Platform, you must launch all jobs in multi-node mode.

170M configuration

The 170M model uses the bf16 data type. It can be trained in about 4 days using 4 nodes with 8 GPUs per node. The model includes 8 transformer layers, a hidden size of 512, a feedforward network size of 1024, and 6 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file mt5/170m.yaml for parameter details.

To train a 170M model on a Slurm cluster, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: mt5/170m stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 170M model on Base Command Platform cluster on 4 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=mt5/170m \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \ training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

To train with a different number of nodes, You can change the relevant parameters (e.g. micro_batch_size) either in the appropriate YAML configuration file or from the command line. See Resuming Training with a Different Number of Nodes for more information. For Base Command Platform, you must launch all jobs in multi-node mode.

390M configuration

The 390M model uses the bf16 data type. It can be trained in about 4 days using 8 nodes with 8 GPUs per node. The model includes 8 transformer layers, a hidden size of 512, a feedforward network size of 2048, and 12 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file mt5/390m.yaml for parameter details.

To train a 390M model on a Slurm cluster, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: mt5/390m stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 390M model on Base Command Platform cluster on 8 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=mt5/390m \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \ training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

3B configuration

The 3B model uses the bf16 data type. It can be trained in about 14 days using 20 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 2048, a feedforward network size of 5120, and 32 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model uses tensor parallelism of 2. For details on the parameters, see the configuration file mt5/3b.yaml.

To train a 3B model, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: mt5/3b stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 3B model on Base Command Platform cluster on 20 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=mt5/3b \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \ training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

The training code can log the model- and system-related metrics to both TensorBoard and Weights & Biases (W&B). The local files are stored in the directory specified in the training.exp_manager.explicit_log_dir parameter. TensorBoard logs are saved by default.

However, the W&B API key must be specified for W&B to work properly. To upload the logs to W&B, you must first store the API key in the first (normally the only) line of a text file and set the wandb_api_key_file parameter to the file’s pathname. For Base Command Platform, you can store this file in a dataset or workspace mounted for the job.

You must set the following training configurations to enable logging of training metrics to W&B:

Copy
Copied!
            

exp_manager: create_wandb_logger: True wandb_logger_kwargs: project: [W&B project name] name: [W&B run name]

The logs show reduced_train_loss, val_loss, train_step_timing metrics, and other relevant metrics. train_step_timing is the measure pf the time it takes to finish each global step.

NVIDIA provides configuration for four BERT model sizes: 110M, 4B, 20B, and 100B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.

The configurations are defined by configuration files in the directory conf/training/bert. You can choose a configuration by selecting the training configuration in the conf/config.yaml file.

For Base Command Platform, you must launch all jobs in multi-node mode.

110M configuration

The 110M model uses the bf16 data type. The model includes 12 transformer layers, a hidden size of 768, a feedforward network size of 3072 and 12 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file bert/110m.yaml for parameter details.

To train a 110M model on a Slurm cluster, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: bert/110m stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 110M model on Base Command Platform cluster on 4 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=bert/110m \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

To train with a different number of nodes, the relevant parameters (e.g. micro_batch_size) can be adjusted either in the appropriate yaml config file or from the command line. See Resuming Training with a Different Number of Nodes for more information. For Base Command Platform, all jobs must be launched in multi-node mode.

4B configuration

The 4B model uses the bf16 data type. The model includes 48 transformer layers, a hidden size of 2560, a feedforward network size of 10240, and 40 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. For the details on all the parameters, see the bert/4b.yaml config file.

To train a 4B model, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: bert/4b stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 4B model on Base Command Platform cluster on 20 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=bert/4b \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

20B configuration

The 20B model uses the bf16 data type. The model includes 48 transformer layers, a hidden size of 6144, a feedforward network size of 24576, and 96 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. For the details on all the parameters, see the bert/20b.yaml config file.

To train a 20B model, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: bert/20b stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 20B model on Base Command Platform cluster on 20 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=bert/20b \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

100B configuration

The 100B model uses the bf16 data type. The model includes 96 transformer layers, a hidden size of 9216, a feedforward network size of 36864, and 96 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. For details on the parameters, see the configuration file bert/100b.yaml.

To train a 100B model, modify the conf/config.yaml file to set:

Copy
Copied!
            

training: bert/100b stages: - training

Then enter:

Copy
Copied!
            

python3 main.py

To train a 100B model on Base Command Platform cluster on 20 nodes, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=bert/100b \ stages=<training> \ launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \ base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \ training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

© Copyright 2023, NVIDIA. Last updated on Nov 14, 2023.