Resuming Training with a Different Number of Nodes

If you need to resume a training run with a different number of nodes, NVIDIA recommends that you keep the global batch size (GBS) unchanged. This ensures that each training step remains almost identical, regardless of the number of nodes. The number of nodes you select must be compatible with the rest of the parameters: GBS must be a multiple of:

(micro_batch_size × num_gpus) / (tensor_parallelism × pipeline_parallelism)

For example, consider a case where:

  • GBS is 1440, its default value for the 5B GPT model

  • MBS is 2

  • The number of GPUs is 20 × 8=160

  • The tensor_parallelism configuration is set to 2

  • The pipeline_parallelism configuration is set to 1

GBS must be set to a multiple of (2 × 160) (2 × 1) = 160. 1440 = 9 × 160, so the value of GBS is valid.

Previous Using AutoConfigurator to Find the Optimal Configuration
Next Generalized PEFT Framework
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.