Training with Predefined Configurations

NVIDIA provides configuration for five T5 model sizes: 220M, 3B, 11B, 23B, and 41B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.

The configurations are defined by configuration files in the directory conf/training/t5. You can choose a configuration by selecting the training configuration in the conf/config.yaml file.

For Base Command Platform, you must launch all jobs in multi-node mode.

220M configuration

The 220M model uses the bf16 data type. It can be trained in about 3.5 days using 4 nodes with 8 GPUs per node. The model includes 12 transformer layers, a hidden size of 768, a feedforward network size of 2048, and 12 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file t5/220m.yaml for parameter details.

To train a 220M model on a Slurm cluster, modify the conf/config.yaml file to set:

training: t5/220m
stages:
  - training

Then enter:

python3 main.py

To train a 220M model on a Base Command Platform cluster with 4 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=t5/220m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

To train with a different number of nodes, you can change the relevant parameters (e.g. micro_batch_size) either in the appropriate YAML file or from the command line. See Resuming Training with a Different Number of Nodes for more information. For Base Command Platform, all jobs must be launched in multi-node mode.

3B configuration

The 3B model uses the bf16 data type. It can be trained in about 7.5 days using 20 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 2048, a feedforward network size of 5120, and 32 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. For details on the parameters, see the configuration file t5/3b.yaml.

To train a 3B model, modify the conf/config.yaml file to set:

training: t5/3b
stages:
  - training

Then enter:

python3 main.py

To train a 3B model on a Base Command Platform cluster with 20 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=t5/3b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

11B configuration

The 11B model uses the bf16 data type. It can be trained in about 26.5 days using 20 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 4096, a feedforward network size of 10240, and 64 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model uses tensor parallelism of 4. For details on the parameters, see the configuration file t5/11b.yaml.

To train a 11B model, modify the conf/config.yaml file to set:

training: t5/11b
stages:
  - training

Then enter:

python3 main.py

To train a 11B model on Base Command Platform cluster on 20 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=t5/11b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

23B configuration

The 23B model uses the bf16 data type. It can be trained in about 36 days using 40 nodes with 8 GPUs per node. The model includes 36 transformer layers, a hidden size of 5120, a feedforward network size of 10880, and 64 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model uses tensor parallelism of 4 and pipeline parallelism of 2. For details on the parameters, see the configuration file t5/23b.yaml.

To train a 23B model, modify the conf/config.yaml file to set:

training: t5/23b
stages:
  - training

Then enter:

python3 main.py

To train a 23B model on Base Command Platform cluster on 40 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=t5/23b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

41B configuration

The 41B model uses the bf16 data type. It can be trained in about 60 days using 40 nodes with 8 GPUs per node. The model includes 36 transformer layers, a hidden size of 6144, a feedforward network size of 10880, and 96 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model uses tensor parallelism of 4 and pipeline parallelism of 2. For details on the parameters, see the configuration file t5/23b.yaml.

To train a 41B model, modify the conf/config.yaml file to set:

training: t5/41b
stages:
  - training

Then enter:

python3 main.py

To train a 41B model on Base Command Platform cluster on 40 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=t5/41b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).