Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Training with Predefined Configurations
NVIDIA provides configuration for five T5 model sizes: 220M, 3B, 11B, 23B, and 41B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.
The configurations are defined by configuration files in the directory conf/training/t5
. You can choose a configuration by selecting the training configuration in the conf/config.yaml
file.
For Base Command Platform, you must launch all jobs in multi-node mode.
220M configuration
The 220M model uses the bf16
data type. It can be trained in about 3.5 days
using 4 nodes with 8 GPUs per node. The model includes 12
transformer layers, a hidden size of 768, a feedforward network size of
2048, and 12 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. This
model does not use any model parallelism. See the configuration file
t5/220m.yaml
for parameter details.
To train a 220M model on a Slurm cluster, modify the
conf/config.yaml
file to set:
training: t5/220m
stages:
- training
Then enter:
python3 main.py
To train a 220M model on a Base Command Platform cluster with 4 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=t5/220m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
To train with a different number of nodes, you can change the relevant parameters (e.g. micro_batch_size
) either in the appropriate YAML file or from the command line. See
Resuming Training with a Different Number of Nodes
for more information. For Base Command Platform, all jobs must be launched in multi-node mode.
3B configuration
The 3B model uses the bf16
data type. It can be trained in about 7.5 days
using 20 nodes with 8 GPUs per node. The model includes 24
transformer layers, a hidden size of 2048, a feedforward network size of
5120, and 32 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. For
details on the parameters, see the configuration file t5/3b.yaml
.
To train a 3B model, modify the conf/config.yaml
file to set:
training: t5/3b
stages:
- training
Then enter:
python3 main.py
To train a 3B model on a Base Command Platform cluster with 20 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=t5/3b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
11B configuration
The 11B model uses the bf16
data type. It can be trained in about 26.5 days
using 20 nodes with 8 GPUs per node. The model includes 24
transformer layers, a hidden size of 4096, a feedforward network size of
10240, and 64 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. This
model uses tensor parallelism of 4. For details on the
parameters, see the configuration file t5/11b.yaml
.
To train a 11B model, modify the conf/config.yaml
file to set:
training: t5/11b
stages:
- training
Then enter:
python3 main.py
To train a 11B model on Base Command Platform cluster on 20 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=t5/11b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
23B configuration
The 23B model uses the bf16
data type. It can be trained in about 36 days
using 40 nodes with 8 GPUs per node. The model includes 36
transformer layers, a hidden size of 5120, a feedforward network size of
10880, and 64 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. This
model uses tensor parallelism of 4 and pipeline parallelism of 2. For
details on the parameters, see the configuration file t5/23b.yaml
.
To train a 23B model, modify the conf/config.yaml
file to set:
training: t5/23b
stages:
- training
Then enter:
python3 main.py
To train a 23B model on Base Command Platform cluster on 40 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=t5/23b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
41B configuration
The 41B model uses the bf16
data type. It can be trained in about 60 days
using 40 nodes with 8 GPUs per node. The model includes 36
transformer layers, a hidden size of 6144, a feedforward network size of
10880, and 96 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. This
model uses tensor parallelism of 4 and pipeline parallelism of 2. For
details on the parameters, see the configuration file t5/23b.yaml
.
To train a 41B model, modify the conf/config.yaml
file to set:
training: t5/41b
stages:
- training
Then enter:
python3 main.py
To train a 41B model on Base Command Platform cluster on 40 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=t5/41b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).