core.distributed.distributed_data_parallel_config#

Module Contents#

Classes#

DistributedDataParallelConfig

Configuration for DistributedDataParallel.

API#

class core.distributed.distributed_data_parallel_config.DistributedDataParallelConfig#

Configuration for DistributedDataParallel.

grad_reduce_in_fp32: bool#

False

If true, reduce grads in fp32.

overlap_grad_reduce: bool#

False

If true, overlap grad all-reduce / reduce-scatter with backward compute.

overlap_param_gather: bool#

False

If true, overlap param all-gather with forward compute.

align_param_gather: bool#

False

If true, all PP stages will launch param all-gathers simultaneously. Otherwise, each PP stage will independently launch as needed.

use_distributed_optimizer: bool#

False

If true, issue reduce-scatter collectives to aggregate gradients and clean up originally allocated model parameters, otherwise issue all-reduce collectives.

num_distributed_optimizer_instances: int#

1

Sets the factor by which the DP domain is sharded to have the partial DistOpt enabled. Defaults to 1, which means DistOpt is across entire DP domain.

check_for_nan_in_grad: bool#

False

If true, check for NaNs and Infs in gradients before communication collective.

check_for_large_grads: bool#

False

If true, check for unexpectedly large gradients before communication collective.

bucket_size: Optional[int]#

None

Maximum number of parameters in each bucket. If unspecified, MCore uses a default value of max(40000000, 1000000 * dp_size) parameters (larger DP sizes need larger buckets to ensure collectives do not become latency-bound).

pad_buckets_for_high_nccl_busbw: bool#

False

If true, make sure the bucket size is divisible by a large power of 2 (2^16) to ensure NCCL collectives have high bus bandwidth at large DP counts, since NCCL message size (which for ring algorithms is bucket_size / dp_size) apparently needs to be divisible by a power of 2 for high busbw.

reduce_scatter_with_fp32_accumulation: bool#

False

If true, use a reduce-scatter implementation which sends lower-precision values over the wire (using an all-to-all to keep total communication overhead in line with the standard ring implementation) but performs accumulation locally in FP32.

average_in_collective: bool#

False

If true, compute average in collective directly, as opposed to dividing by the dp_size first and then computing sum in the collective.

fp8_param_gather: bool#

False

If true, keep the compute param in fp8 (do not use any other intermediate dtype) and perform the param all-gather in fp8.

reuse_grad_buf_for_mxfp8_param_ag: bool#

False

If true, reuse the grad buffer for param AG when using mxfp8 recipe. Should be set to True only when fp8_recipe is mxfp8 and fp8_param_gather is True.

use_megatron_fsdp: bool#

False

If true, use the FSDP code path for DDP.

use_custom_fsdp: bool#

False

NOTE: The flag use_custom_fsdp is deprecated and will be removed in future versions. Please use use_megatron_fsdp instead, as all functionality will be migrated there. Future updates will drop support for use_custom_fsdp to avoid confusion.

data_parallel_sharding_strategy: str#

‘no_shard’

Sharding strategy for FSDP. Valid values are ‘no_shard’, ‘optim’, ‘optim_grads’, ‘optim_grads_params’.

gradient_reduce_div_fusion: bool#

True

If true, perform gradient reduce and division fusion.

suggested_communication_unit_size: int#

None

Specifies the number of elements to communicate at once during FSDP (Fully Sharded Data Parallel) operations. This flag also affects FSDP all-gather prefetch behavior. Setting a larger value increases the communication buffer size, while a smaller value disables prefetching and may degrade performance. Adjust this value based on your system’s memory and performance requirements.

preserve_fp32_weights: bool#

True

If true, preserve fp32 weights in the Megatron FSDP ParamAndGradBuffer.

keep_fp8_transpose_cache: bool#

False

If true, keep the fp8 transpose cache when using Megatron FSDP.

nccl_ub: bool#

False

If true, allocate and register NCCL userbuffer for param and grad buffer. This flag enables SM efficient nccl algorithm that could improve the performance of FSDP and DP with comm_overlap. This flag will be much more effective when used together with sharp. The follwoing will be the expected number of SM usage for various cases. (Note that this is just a reference number and the number of SM usage could vary on message size, communication domain size and nccl version.)#

Communication domain

use_sharp

SM usage of “AG/RS”

NVL

N/A

4 / 5

NVL+IB

False

16 / 16

NVL+IB

True

6 / 6

IB

False

1 / 4

IB

True

1 / 1


fsdp_double_buffer: bool#

False

If true, use persistently allocated double buffers for the temporary memory needed in the Megatron FSDP communications. This option will cause additional memory overhead, however, it is necessary for to register user buffer (nccl_ub=True) for the Megatron FSDP. This option will be automatically set to True when nccl_ub=True.

outer_dp_sharding_strategy: str#

‘no_shard’

Sharding strategy for outer data parallel group in Hybrid Sharded Data Parallel (HSDP) mode. Valid values are ‘no_shard’, ‘optim’, ‘optim_grads’, ‘optim_grads_params’. This option is only effective when Hybrid FSDP is enabled.

disable_symmetric_registration: bool#

False

If true, disable symmetric (window) registration for NCCL userbuffer registration. This option will force to use conventional (local) userbuffer registration when nccl_ub is set.

delay_wgrad_compute: bool#

False

Delay the weight gradient computation to improve batch-level communication overlapping

__post_init__()#