Resuming Training with a Different Number of Nodes

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide (Latest) Resuming Training with a Different Number of Nodes

User Guide (Latest Version)

If you need to resume a training run with a different number of nodes, NVIDIA recommends that you keep the global batch size (GBS) unchanged. This ensures that each training step remains almost identical, regardless of the number of nodes. The number of nodes you select must be compatible with the rest of the parameters: GBS must be a multiple of:

(micro_batch_size × num_gpus) / (tensor_parallelism × pipeline_parallelism)

For example, consider a case where:

GBS is 1440, its default value for the 5B GPT model
MBS is 2
The number of GPUs is 20 × 8=160
The tensor_parallelism configuration is set to 2
The pipeline_parallelism configuration is set to 1

GBS must be set to a multiple of (2 × 160) (2 × 1) = 160. 1440 = 9 × 160, so the value of GBS is valid.

Previous Using AutoConfigurator to Find the Optimal Configuration

Next Exporting NeMo Models to TensorRT-LLM