nemo_automodel.components.distributed.megatron_fsdp

View as Markdown

Module Contents

Classes

NameDescription
MegatronFSDPManagerManager for parallelizing models using MegatronFSDP with TP, DP, CP sharding.

Functions

NameDescription
fully_shard_optimizer-
maybe_shard_optimizerShard the optimizer with Megatron-FSDP when the strategy requires it.

Data

HAS_MEGATRON_FSDP

logger

API

class nemo_automodel.components.distributed.megatron_fsdp.MegatronFSDPManager(
config: nemo_automodel.components.distributed.config.MegatronFSDPConfig,
device_mesh: torch.distributed.device_mesh.DeviceMesh
)

Manager for parallelizing models using MegatronFSDP with TP, DP, CP sharding.

This manager applies parallelization to the model using a prescribed TP sharding plan. It supports mixed precision and various FSDP options.

The device mesh must be created externally and passed in.

Parameters:

config
MegatronFSDPConfig

Configuration for MegatronFSDP distributed training.

device_mesh
DeviceMesh

Device mesh for distributed operations.

activation_checkpointing
= config.activation_checkpointing
average_in_collective
= config.average_in_collective
calculate_per_token_loss
= config.calculate_per_token_loss
check_for_nan_in_grad
= config.check_for_nan_in_grad
disable_bucketing
= config.disable_bucketing
fsdp_double_buffer
= config.fsdp_double_buffer
grad_reduce_in_fp32
= config.grad_reduce_in_fp32
init_fsdp_with_meta_device
= config.init_fsdp_with_meta_device
keep_fp8_transpose_cache
= config.keep_fp8_transpose_cache
megatron_fsdp_unit_modules
= config.megatron_fsdp_unit_modules
nccl_ub
= config.nccl_ub
overlap_grad_reduce
= config.overlap_grad_reduce
overlap_param_gather
= config.overlap_param_gather
preserve_fp32_weights
= config.preserve_fp32_weights
zero_dp_strategy
= config.zero_dp_strategy
nemo_automodel.components.distributed.megatron_fsdp.MegatronFSDPManager.parallelize(
model,
optimizer = None
)

Parallelizes the given model using MegatronFSDP and TP sharding strategies.

Parameters:

model

The model to be parallelized.

optimizer
Defaults to None

The optimizer for the model. If None, user needs to call model.finish_grad_sync() before optimizer.step(), model.install_optimized_model_weights() and model.zero_grad_buffer() after optimizer.zero_grad().

Returns:

(parallelized_model, optimizer)

nemo_automodel.components.distributed.megatron_fsdp.fully_shard_optimizer(
model: torch.nn.Module,
optimizer: torch.optim.Optimizer,
preproc_state_dict_for_dcp_ckpt: bool = True
) -> torch.optim.Optimizer
nemo_automodel.components.distributed.megatron_fsdp.maybe_shard_optimizer(
model_part: torch.nn.Module,
optimizer: torch.optim.Optimizer,
distributed_config: nemo_automodel.components.distributed.config.DistributedConfig | None,
allow: bool = True
) -> torch.optim.Optimizer

Shard the optimizer with Megatron-FSDP when the strategy requires it.

Returns the optimizer unchanged unless distributed_config is a :class:MegatronFSDPConfig running in a distributed (world size > 1) job.

Parameters:

model_part
nn.Module

The (already sharded) model part the optimizer belongs to.

optimizer
torch.optim.Optimizer

The optimizer to (optionally) shard.

distributed_config
DistributedConfig | None

Distributed strategy config; only triggers sharding when it is a :class:MegatronFSDPConfig.

allow
boolDefaults to True

Guard for optimizers incompatible with Megatron-FSDP sharding (e.g. Dion); asserts when sharding would otherwise apply.

nemo_automodel.components.distributed.megatron_fsdp.HAS_MEGATRON_FSDP = True
nemo_automodel.components.distributed.megatron_fsdp.logger = logging.getLogger(__name__)