Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Training with Predefined Configurations
NVIDIA provides configuration for three mT5 model sizes: 170M, 390M, and 3B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.
The configurations are defined by configuration files in the directory conf/training/mt5
. You can choose a configuration by selecting the training configuration in the conf/config.yaml
file.
For Base Command Platform, you must launch all jobs in multi-node mode.
170M configuration
The 170M model uses the bf16
data type. It can be trained in about 4 days
using 4 nodes with 8 GPUs per node. The model includes 8
transformer layers, a hidden size of 512, a feedforward network size of
1024, and 6 attention heads with GeGLU activation function. The sequence
length is 512, and the optimizer is Distributed Adam. This model does
not use any model parallelism. See the configuration file mt5/170m.yaml
for
parameter details.
To train a 170M model on a Slurm cluster, modify the
conf/config.yaml
file to set:
training: mt5/170m
stages:
- training
Then enter:
python3 main.py
To train a 170M model on Base Command Platform cluster on 4 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=mt5/170m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
To train with a different number of nodes, You can change the relevant parameters (e.g. micro_batch_size
) either in the appropriate YAML configuration file or from the command line. See
Resuming Training with a Different Number of Nodes
for more information. For Base Command Platform, you must launch all jobs in multi-node mode.
390M configuration
The 390M model uses the bf16
data type. It can be trained in about 4 days
using 8 nodes with 8 GPUs per node. The model includes 8
transformer layers, a hidden size of 512, a feedforward network size of
2048, and 12 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. This
model does not use any model parallelism. See the configuration file mt5/390m.yaml
for parameter details.
To train a 390M model on a Slurm cluster, modify the
conf/config.yaml
file to set:
training: mt5/390m
stages:
- training
Then enter:
python3 main.py
To train a 390M model on Base Command Platform cluster on 8 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=mt5/390m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
3B configuration
The 3B model uses the bf16
data type. It can be trained in about 14 days
using 20 nodes with 8 GPUs per node. The model includes 24 transformer
layers, a hidden size of 2048, a feedforward network size of 5120, and
32 attention heads with GeGLU activation function. The sequence length
is 512, and the optimizer is Distributed Adam. This model uses tensor
parallelism of 2. For details on the parameters, see the configuration file mt5/3b.yaml
.
To train a 3B model, modify the conf/config.yaml
file to set:
training: mt5/3b
stages:
- training
Then enter:
python3 main.py
To train a 3B model on Base Command Platform cluster on 20 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=mt5/3b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).