Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Training with Predefined Configurations

NVIDIA provides configuration for three mT5 model sizes: 170M, 390M, and 3B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.

The configurations are defined by configuration files in the directory conf/training/mt5. You can choose a configuration by selecting the training configuration in the conf/config.yaml file.

For Base Command Platform, you must launch all jobs in multi-node mode.

170M configuration

The 170M model uses the bf16 data type. It can be trained in about 4 days using 4 nodes with 8 GPUs per node. The model includes 8 transformer layers, a hidden size of 512, a feedforward network size of 1024, and 6 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file mt5/170m.yaml for parameter details.

To train a 170M model on a Slurm cluster, modify the conf/config.yaml file to set:

training: mt5/170m
stages:
  - training

Then enter:

python3 main.py

To train a 170M model on Base Command Platform cluster on 4 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=mt5/170m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

To train with a different number of nodes, You can change the relevant parameters (e.g. micro_batch_size) either in the appropriate YAML configuration file or from the command line. See Resuming Training with a Different Number of Nodes for more information. For Base Command Platform, you must launch all jobs in multi-node mode.

390M configuration

The 390M model uses the bf16 data type. It can be trained in about 4 days using 8 nodes with 8 GPUs per node. The model includes 8 transformer layers, a hidden size of 512, a feedforward network size of 2048, and 12 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file mt5/390m.yaml for parameter details.

To train a 390M model on a Slurm cluster, modify the conf/config.yaml file to set:

training: mt5/390m
stages:
  - training

Then enter:

python3 main.py

To train a 390M model on Base Command Platform cluster on 8 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=mt5/390m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

3B configuration

The 3B model uses the bf16 data type. It can be trained in about 14 days using 20 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 2048, a feedforward network size of 5120, and 32 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model uses tensor parallelism of 2. For details on the parameters, see the configuration file mt5/3b.yaml.

To train a 3B model, modify the conf/config.yaml file to set:

training: mt5/3b
stages:
  - training

Then enter:

python3 main.py

To train a 3B model on Base Command Platform cluster on 20 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=mt5/3b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).