Training with Predefined Configurations

NVIDIA provides configuration for four BERT model sizes: 110M, 4B, 20B, and 100B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.

The configurations are defined by configuration files in the directory conf/training/bert. You can choose a configuration by selecting the training configuration in the conf/config.yaml file.

For Base Command Platform, you must launch all jobs in multi-node mode.

110M configuration

The 110M model uses the bf16 data type. The model includes 12 transformer layers, a hidden size of 768, a feedforward network size of 3072 and 12 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the configuration file bert/110m.yaml for parameter details.

To train a 110M model on a Slurm cluster, modify the conf/config.yaml file to set:

training: bert/110m
stages:
  - training

Then enter:

python3 main.py

To train a 110M model on Base Command Platform cluster on 4 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=bert/110m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

To train with a different number of nodes, the relevant parameters (e.g. micro_batch_size) can be adjusted either in the appropriate yaml config file or from the command line. See Resuming Training with a Different Number of Nodes for more information. For Base Command Platform, all jobs must be launched in multi-node mode.

4B configuration

The 4B model uses the bf16 data type. The model includes 48 transformer layers, a hidden size of 2560, a feedforward network size of 10240, and 40 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. For the details on all the parameters, see the bert/4b.yaml config file.

To train a 4B model, modify the conf/config.yaml file to set:

training: bert/4b
stages:
  - training

Then enter:

python3 main.py

To train a 4B model on Base Command Platform cluster on 20 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=bert/4b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

20B configuration

The 20B model uses the bf16 data type. The model includes 48 transformer layers, a hidden size of 6144, a feedforward network size of 24576, and 96 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. For the details on all the parameters, see the bert/20b.yaml config file.

To train a 20B model, modify the conf/config.yaml file to set:

training: bert/20b
stages:
  - training

Then enter:

python3 main.py

To train a 20B model on Base Command Platform cluster on 20 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=bert/20b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).

100B configuration

The 100B model uses the bf16 data type. The model includes 96 transformer layers, a hidden size of 9216, a feedforward network size of 36864, and 96 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. For details on the parameters, see the configuration file bert/100b.yaml.

To train a 100B model, modify the conf/config.yaml file to set:

training: bert/100b
stages:
  - training

Then enter:

python3 main.py

To train a 100B model on Base Command Platform cluster on 20 nodes, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=bert/100b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that you mounted the data workspace in /mount/data and the results workspace in /mount/results. $NGC_ARRAY_SIZE is automatically set to the number of nodes selected when creating the job (the number of replicas).