Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Batching#

Batch size is one of the first parameters you should play with. For efficiency and convergence reasons we recommend you first try maximizing your batch size per GPU so that your GPU RAM usage is maximized.

NeMo Megatron uses the following concepts.

Micro batch size is the number of examples per data parallel rank. It is controlled by model.micro_batch_size parameter.

Global batch size = micro_batch_size * data_parallel_size * gradient_accumulation_steps. For details on data_parallel_size see Parallelisms section, but typically it is equal to the number of GPUs being used. Global batch size is controlled by model.global_batch_size parameter.

Gradient Accumulation

  • Idea: Train with large batch sizes with fixed memory footprint at the cost of additional compute.

  • Do k forward and backward passes through the network with different batches, do not perform parameter updates until after k passes.

  • Update paramters