Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Resuming Training with a Different Number of Nodes
If you need to resume a training run with a different number of nodes, NVIDIA recommends that you keep the global batch size (GBS) unchanged. This ensures that each training step remains almost identical, regardless of the number of nodes. The number of nodes you select must be compatible with the rest of the parameters: GBS must be a multiple of:
(micro_batch_size × num_gpus) / (tensor_parallelism × pipeline_parallelism)
For example, consider a case where:
GBS is 1440, its default value for the 5B GPT model
MBS is 2
The number of GPUs is 20 × 8=160
The
tensor_parallelism
configuration is set to 2The
pipeline_parallelism
configuration is set to 1
GBS must be set to a multiple of (2 × 160) (2 × 1) = 160. 1440 = 9 × 160, so the value of GBS is valid.