`core.model_parallel_config`#

Module Contents#

Classes#

ModelParallelConfig

Base configuration for Megatron Core

API#

class core.model_parallel_config.ModelParallelConfig#

Base configuration for Megatron Core

The initialization function has an argument for each parameter.

tensor_model_parallel_size: int#

Intra-layer model parallelism. Splits tensors across GPU ranks.

pipeline_model_parallel_comm_backend: Optional[Literal[nccl, ucc]]#

None

Configuring backend option of pipeline parallel communication (e.g., nccl, ucc) If None, the default backend will be used.

pipeline_model_parallel_size: int#

Inter-layer model parallelism. Splits transformer layers across GPU ranks.

virtual_pipeline_model_parallel_size: Optional[int]#

None

Interleaved pipeline parallelism is used to improve performance by reducing the pipeline bubble. Considers a transformer block as a list of smaller transformer (virtual) blocks. The number of virtual blocks per pipeline model parallel rank is the virtual model parallel size. See Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM: arxiv.org/pdf/2104.04473.pdf for more details.

sequence_parallel: bool#

False

Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially. See Reducing Activation Recomputation in Large Transformer Models (https://arxiv.org/abs/2205.05198) for more details.

context_parallel_size: int#

Splits network input along sequence dimension across GPU ranks.

hierarchical_context_parallel_sizes: Optional[list[int]]#

None

Degrees of the hierarchical context parallelism. Users should provide a list to specify the sizes for different levels. Taking the a2a+p2p cp comm type as example, it contains groups of two levels, so the first value of the list indicates the group size of the a2a communication type, and the second value indicates the group size of the p2p communication type.

max_seqlen_per_dp_cp_rank: Optional[int]#

None

Maximum sequence length per DPxCP rank. This is the maximum sequence length each rank can handle without overflowing the memory. Typically, a good starting point is to set this to maximum sequence length / context parallel size. This is used to calculate the number and length of sub-samples assigned to each rank when using hybrid_context_parallel.

hybrid_context_parallel: bool#

False

If true, enables hybrid context parallel. This is used to balance the workload of each CP rank when we use packed samples with variable sequence lengths. Please set max_seqlen_per_dp_cp_rank when using hybrid_context_parallel.

expert_model_parallel_size: int#

Distributes Moe Experts across sub data parallel dimension.

expert_tensor_parallel_size: Optional[int]#

None

Intra-layer tensor model parallelism for expert layer. Splits tensors across GPU ranks. Default is None, which will be set to the value of tensor_model_parallel_size.

moe_extended_tp: bool#

False

NOTE: Deprecated from MCore v0.10. This flag is ignored. Its functionality is replaced by expert_tensor_parallel_size.

perform_initialization: bool#

‘field(…)’

Controls weights initialization. This option can be useful when you know you are going to load values from a checkpoint.

use_cpu_initialization: bool#

‘field(…)’

When set to False, we initialize the weights directly on the GPU. CPU initialization is the same regardless of tensor model parallelism, but GPU initialization is not. Transferring weights from CPU to GPU can take a significant amount of time for large models.

fp16: bool#

False

If true, train with fp16 mixed precision training.

bf16: bool#

False

If true, train with bf16 mixed precision training.

params_dtype: torch.dtype#

None

dtype used when intializing the weights.

timers: Optional[Callable]#

None

Timers object to call for various timing functions. See megatron.core.timers.Timers

finalize_model_grads_func: Optional[Callable]#

None

Function that finalizes gradients on all workers. Could include ensuring that grads are all-reduced across data parallelism, pipeline parallelism, and sequence parallelism dimensions.

grad_scale_func: Optional[Callable]#

None

If using loss scaling, this function should take the loss and return the scaled loss. If None, no function is called on the loss.

no_sync_func: Optional[Callable]#

None

Function that creates a context that suppresses asynchronous data-parallel communication. If the model is an instance of core.distributed.DistributedDataParallel, the default is to use core.distributed.DistributedDataParallel.no_sync.

grad_sync_func: Optional[Callable]#

None

Function that launches asynchronous gradient reductions (e.g. distributed optimizer gradient reduce-scatters). The function should take one argument: an iterable of parameters whose gradients are to be synchronized.

param_sync_func: Optional[Callable]#

None

Function that launches asynchronous parameter synchronizations (e.g. distributed optimizer parameter all-gathers). The function should take one argument: an iterable of parameters to be synchronized.

deterministic_mode: bool#

False

If true, code that has deterministic execution will be chosen. This usually means slower execution, but is good for debugging and testing. Defaults to False.

enable_autocast: bool#

False

If true runs the forward step function inside torch.autocast context.

autocast_dtype: Optional[torch.dtype]#

None

dtype to pass to torch.amp.autocast when enabled. If None, is set to pipeline_dtype.

num_microbatches_with_partial_activation_checkpoints: Optional[int]#

None

If int, set the number of microbatches where not all of the layers will be checkpointed and recomputed. The rest of the microbatches within the window of maximum outstanding microbatches will recompute all layers (either full recompute or selective recompute). If None, the checkpoint and recompute will be left up to the forward_step function.

gradient_accumulation_fusion: bool#

False

If true, fuses weight gradient accumulation to GEMMs. Requires the custom CUDA extension fused_weight_gradient_mlp_cuda module. To use gradient_accumulation_fusion you must install APEX with –cpp_ext and –cuda_ext. For example: “pip install –global-option=”–cpp_ext” –global-option=”–cuda_ext” “. Note that the extension requires CUDA>=11. Otherwise, you must turn off gradient accumulation fusion.

async_tensor_model_parallel_allreduce: bool#

True

NOTE: Deprecated. This flag is ignored.

use_te_rng_tracker: bool#

‘field(…)’

If true, uses RNG state tracker in TransformerEngine if exists. Required for CUDA graphs support.

tp_comm_overlap: bool#

False

If true, allows overlapping of Linear layer execution with tensor parallel communication collectives like AllGather/ReduceScatter. Overlapping is done for the linear layers wherever possible during the forward and the backward pass.

tp_comm_bulk_wgrad: bool#

True

Controls All-Gather overlap with Bprop activation gradient GEMM. Don’t care if tp_comm_overlap is False.

tp_comm_bulk_dgrad: bool#

True

Controls Reduce-Scatter overlap with Bprop weight gradient GEMM. Don’t care if tp_comm_overlap is False.

tp_comm_overlap_ag: bool#

True

Controls All-Gather overlap with GEMM by pipelining the GEMM and All-Gather. Don’t care if tp_comm_overlap is False.

tp_comm_overlap_rs: bool#

True

Controls Reduce-Scatter overlap with GEMM by pipelining the GEMM and Reduce-Scatter. Don’t care if tp_comm_overlap is False.

tp_comm_overlap_rs_dgrad: bool#

False

If true, allows Reduce-Scatter overlap with DGRAD GEMM by pipelining the GEMM and Reduce-Scatter splits. Don’t care if tp_comm_overlap is False.

tp_comm_split_ag: bool#

True

Deprecated from TransformerEngine v1.6.0. Controls All-Gather overlap with Fprop GEMM by pipelining the GEMM and All-Gather splits. Don’t care if tp_comm_overlap is False.

tp_comm_atomic_ag: bool#

False

Deprecated from TransformerEngine v1.6.0. If true, allows All-Gather overlap with Fprop GEMM by pipelining the GEMM and All-Gather both done atomically. Don’t care if tp_comm_overlap is False.

tp_comm_split_rs: bool#

True

Deprecated from TransformerEngine v1.6.0. Controls Reduce-Scatter overlap with Fprop GEMM by pipelining the GEMM and Reduce-Scatter splits. Don’t care if tp_comm_overlap is False.

tp_comm_atomic_rs: bool#

False

Deprecated from TransformerEngine v1.6.0. If true, allows Reduce-Scatter overlap with Fprop GEMM by pipelining the GEMM and Reduce-Scatter both done atomically. Don’t care if tp_comm_overlap is False.

cross_entropy_loss_fusion: bool#

False

If this is enabled, the fused cross entropy implementation would be used. Defaults to False.

cross_entropy_fusion_impl: Literal[native, te]#

‘native’

If ‘native’, MCore based CE loss fusion is used, if ‘te’, Parallel CE loss from Transformer Engine library is used. Defaults to ‘native’.

tp_comm_overlap_disable_qkv: bool#

False

If true, the AllGather -> Gemm overlap for QKV gets disabled

tp_comm_overlap_disable_fc1: bool#

False

If true, the AllGather -> Gemm overlap for FC1 layer of MLP gets disabled

tp_comm_bootstrap_backend: Literal[nccl, mpi, gloo]#

‘nccl’

Set the bootstrapping backend of Tensor parallel communications.

overlap_moe_expert_parallel_comm: bool#

False

Overlap EP A2A communications with independent computations of different micro-batches in 1f1b phase of pipelining or non-pipelining schedule.

delay_wgrad_compute: bool#

False

Delay the weight gradient computation to improve batch-level communication overlapping

ep_overlap_early_attn_memory_release: bool#

False

Enable early memory release of attention activations during EP overlap. EP overlap can increase peak memory usage when the overlapped forward module allocates more memory than what is freed by the backward module. This flag addresses this by reordering the attention backward pass to occur earlier in the schedule. Specifically:

Without this flag: attn_bwd executes after moe_combine_fwd
With this flag: attn_bwd executes before mlp_fwd The earlier execution releases attention activations sooner, reducing peak memory. Note: This may impact performance as moe_combine_fwd and moe_dispatch_bwd become exposed (not overlapped with other computation).

pipeline_dtype: torch.dtype#

None

dtype used in p2p communication, usually params_dtype

variable_seq_lengths: bool#

False

Support for variable sequence lengths across microbatches. Setting this communicates the size of tensors during pipeline parallelism communication, because of this extra overhead it should only be set if the sequence length varies by microbatch within a global batch.

overlap_p2p_comm: bool#

False

When True some of the peer to peer communication for pipeline parallelism will overlap with computation. Must be False if batch_p2p_comm is true.

batch_p2p_comm: bool#

True

Use batch_isend_irecv instead of individual isend/irecv calls. Must be False if overlap_p2p_comm is True.

batch_p2p_sync: bool#

True

When using batch_isend_irecv, do a cuda.device.synchronize afterward to work around a bug in older version of PyTorch.

use_ring_exchange_p2p: bool#

False

Use custom ring_exchange kernel instead of torch.distributed.batch_isend_irecv(). Requires custom built torch with torch.distributed.ring_exchange.

deallocate_pipeline_outputs: bool#

False

If True, output data is deallocated after the tensor is sent to the next pipeline stage. Helps with saving memory, does nothing when pipeline parallel is not used.

defer_embedding_wgrad_compute: bool#

False

If true, defers the embedding WGRAD GEMMs while pipeline flush is taking place enabling us to hide pipeline flush latency. Defaults to False.

wgrad_deferral_limit: int#

This value tunes the number of micro-batches for which the embedding weight gradient compute needs to be deferred to pipeline flush, this argument is invalid if defer_embedding_wgrad_compute is False. Defaults to 0, which means all micro-batches are deferred.

overlap_p2p_comm_warmup_flush: bool#

‘field(…)’

If true, overlap communication and computation in warm up and flush phase. Only valid when overlap_p2p_comm is True and batch_p2p_comm is False. Defaults to False.

microbatch_group_size_per_vp_stage: Optional[int]#

‘field(…)’

This value specifies the number of micro-batches that are executed at a time for a given virtual stage (both forward and backward). Default (in post_init() method below) to pipeline_parallel_size which specifies a depth-first schedule. Example: for PP=2 VP=2, when microbatch_group_size_per_vp_stage=2, num_microbatches = 4, we have rank 0 | 0 1 0 1 2 3 2 3 rank 1 | 0 1 0 1 2 3 2 3 When microbatch_group_size_per_vp_stage=3, num_microbatches = 5, we have rank 0 | 0 1 2 0 1 2 3 4 3 4 rank 1 | 0 1 2 0 1 2 3 4 3 4

mtp_standalone: bool#

False

This will be set automatically according to the pipeline layout, and will be set to True if MTP is in a separate vpp stage.

cpu_offloading: bool#

False

When set to True, all the activations are offloaded to the CPU asynchronously.

cpu_offloading_num_layers: int#

Tells the number of transformer layers for which activations has to be offloaded.

_cpu_offloading_context: Optional[ContextManager]#

None

For internal use only, do not set.

cpu_offloading_activations: bool#

True

If True, offloads the activations to CPU.

cpu_offloading_weights: bool#

False

If True, offloads the weights to CPU.

cpu_offloading_double_buffering: bool#

False

If True, enables double buffering across layers while reloading activations from CPU.

barrier_with_L1_time: bool#

‘field(…)’

Controls barrier with level 1 time measurements. It is up to the user to make sure calling barrier with their timers will not result in hangs. This can happen if for example the user adds a level 1 timer that is not called by all ranks.

__post_init__()#: Python dataclass method that is used to modify attributes after initialization. See https://docs.python.org/3/library/dataclasses.html#post-init-processing for more details.

core.model_parallel_config#

Module Contents#

Classes#

API#

`core.model_parallel_config`#