NVIDIA Megatron-Core
Developer Guide (Latest)

distributed package

This package contains various utilities to finalize model weight gradients on each rank before the optimizer step. This includes a distributed data parallelism wrapper to all-reduce or reduce-scatter the gradients across data-parallel replicas, and a finalize_model_grads method to synchronize gradients across different parallelism modes (e.g., ‘tied’ layers on different pipeline stages, or gradients for experts in a MoE on different ranks due to expert parallelism).

Model wrapper for distributed data parallelism. Stores gradients in a contiguous buffer, and supports the option of overlapping communication (all-reduce or reduce-scatter) with backprop computation by breaking up full model’s gradients into smaller buckets and running all-reduce / reduce-scatter on each bucket asynchronously.

class core.distributed.distributed_data_parallel.DistributedDataParallel(*args: Any, **kwargs: Any)

Bases: core.transformer.module.MegatronModule

DDP wrapper which stores grads in contiguous buffers. Also has option of overlapping communication with backprop computation by breaking up full model’s gradients into smaller buckets and running all-reduce / reduce-scatter on each bucket asynchronously. This class also provides the option to do the gradient accumulation in a type other than the param type (e.g., fp32 for a bf16 model).

Parameters
  • config – Transformer config object.

  • module – Underlying model.

  • data_parallel_group – Data-parallel process group.

  • accumulate_allreduce_grads_in_fp32 – If true, do the gradient accumulation and communication in fp32.

  • overlap_grad_reduce – If true, overlap communication with backprop computation by breaking up grads into buckets. If false, single synchronous communication call is used instead.

  • use_distributed_optimizer – If true, issue reduce-scatter communication calls as part of distributed optimizer. If false, issue all-reduce communication calls.

  • disable_bucketing – If true, force assign all parameters to a single bucket. If false, use standard bucketing policy: assign parameters to smaller buckets and all-reduce per bucket _if_ overlap_grad_reduce is True and pp_rank is 0.

  • check_for_nan_in_grad – If true, check if local grad norm is NaN.

broadcast_params()

Syncs parameters across all DP ranks.

finish_grad_sync()

Finishes grad sync (all-reduce or reduce-scatter) communication operations for all model gradients.

When overlap_grad_reduce is set to True, waits for asynchronous communication calls to complete. When overlap_grad_reduce is set to False, calls synchronous communication ops.

forward(*inputs, **kwargs)

Calls the wrapped module’s forward() method.

load_state_dict(state_dict, strict=True)

Copies parameters and buffers from state_dict into the wrapped module and its descendants. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

no_sync()

Context manager that turns off gradient synchronization.

start_grad_sync(*unused)

Initiates grad sync (all-reduce or reduce-scatter) communication operations for all model gradients.

When overlap_grad_reduce is set to True, dispatches asynchronous communication calls. When overlap_grad_reduce is set to False, calls synchronous communication ops.

state_dict(prefix='', keep_vars=False)

Returns a dictionary containing references to the whole state of the wrapped module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

state_dict_for_save_checkpoint(prefix='', keep_vars=False)

Returns wrapped module’s state_dict for checkpoint saving.

zero_grad_buffer(zero_buffer)

Zeros out all grad buffers. Needs to be called at the beginning of each training iteration.

When zero_buffer is set to True, the underlying grad buffer is zeroed out.

Finalize model gradients for optimizer step across all used parallelism modes. Synchronizes the all-reduce / reduce-scatter of model gradients across DP replicas, all-reduces the layernorm gradients for sequence parallelism, embedding gradients across first and last pipeline stages (if not tied), and expert gradients for expert parallelism.

core.distributed.finalize_model_grads.finalize_model_grads(model: List[torch.nn.Module])

All-reduce all model grads across DP replicas, layernorm grads for sequence parallelism, embedding grads across first and last pipeline stages (if not tied).

Contains functionality to synchronize gradients across different ranks before optimizer step.

Previous dist_checkpointing.strategies package
Next datasets package
© Copyright 2022-2024, NVIDIA. Last updated on Mar 16, 2024.