Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Training with Predefined Configurations
NVIDIA provides configuration for four BERT model sizes: 110M, 4B, 20B, and 100B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.
The configurations are defined by configuration files in the directory conf/training/bert
. You can choose a configuration by selecting the training configuration in the conf/config.yaml
file.
For Base Command Platform, you must launch all jobs in multi-node mode.
110M configuration
The 110M model uses the bf16
data type. The model includes 12
transformer layers, a hidden size of 768, a feedforward network size of
3072 and 12 attention heads with GeGLU activation function. The sequence
length is 512, and the optimizer is Distributed Adam. This model does
not use any model parallelism. See the configuration file bert/110m.yaml
for parameter details.
To train a 110M model on a Slurm cluster, modify the
conf/config.yaml
file to set:
training: bert/110m
stages:
- training
Then enter:
python3 main.py
To train a 110M model on Base Command Platform cluster on 4 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=bert/110m \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
To train with a different number of nodes, the relevant parameters (e.g. micro_batch_size
) can be adjusted either in the appropriate yaml config file or from the command line. See
Resuming Training with a Different Number of Nodes
for more information. For Base Command Platform, all jobs must be launched in multi-node mode.
4B configuration
The 4B model uses the bf16
data type. The model includes 48 transformer
layers, a hidden size of 2560, a feedforward network size of 10240, and
40 attention heads with GeGLU activation function. The sequence length
is 512, and the optimizer is Distributed Adam. For the details on all
the parameters, see the bert/4b.yaml
config file.
To train a 4B model, modify the conf/config.yaml
file to set:
training: bert/4b
stages:
- training
Then enter:
python3 main.py
To train a 4B model on Base Command Platform cluster on 20 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=bert/4b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
20B configuration
The 20B model uses the bf16
data type. The model includes 48 transformer
layers, a hidden size of 6144, a feedforward network size of 24576, and
96 attention heads with GeGLU activation function. The sequence length
is 512, and the optimizer is Distributed Adam. For the details on all
the parameters, see the bert/20b.yaml
config file.
To train a 20B model, modify the conf/config.yaml
file to set:
training: bert/20b
stages:
- training
Then enter:
python3 main.py
To train a 20B model on Base Command Platform cluster on 20 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=bert/20b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).
100B configuration
The 100B model uses the bf16
data type. The model includes 96
transformer layers, a hidden size of 9216, a feedforward network size of
36864, and 96 attention heads with GeGLU activation function. The
sequence length is 512, and the optimizer is Distributed Adam. For
details on the parameters, see the configuration file bert/100b.yaml
.
To train a 100B model, modify the conf/config.yaml
file to set:
training: bert/100b
stages:
- training
Then enter:
python3 main.py
To train a 100B model on Base Command Platform cluster on 20 nodes, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py training=bert/100b \
stages=<training> \
launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp
The command above assumes that you mounted the data workspace in /mount/data
and the results workspace in /mount/results
. $NGC_ARRAY_SIZE
is automatically set to the number of nodes selected when creating the job (the number of replicas).