nemo_automodel.components.distributed.megatron_fsdp

Module Contents

Classes

Name	Description
`MegatronFSDPManager`	Manager for parallelizing models using MegatronFSDP with TP, DP, CP sharding.

Functions

Name	Description
`fully_shard_optimizer`	-
`maybe_shard_optimizer`	Shard the optimizer with Megatron-FSDP when the strategy requires it.

Data

HAS_MEGATRON_FSDP

logger

API

class nemo_automodel.components.distributed.megatron_fsdp.MegatronFSDPManager(
    config: nemo_automodel.components.distributed.config.MegatronFSDPConfig,
    device_mesh: torch.distributed.device_mesh.DeviceMesh
)

Manager for parallelizing models using MegatronFSDP with TP, DP, CP sharding.

This manager applies parallelization to the model using a prescribed TP sharding plan. It supports mixed precision and various FSDP options.

The device mesh must be created externally and passed in.

Parameters:

config

MegatronFSDPConfig

Configuration for MegatronFSDP distributed training.

device_mesh

DeviceMesh

Device mesh for distributed operations.

activation_checkpointing

= config.activation_checkpointing

average_in_collective

= config.average_in_collective

calculate_per_token_loss

= config.calculate_per_token_loss

check_for_nan_in_grad

= config.check_for_nan_in_grad

disable_bucketing

= config.disable_bucketing

fsdp_double_buffer

= config.fsdp_double_buffer

grad_reduce_in_fp32

= config.grad_reduce_in_fp32

init_fsdp_with_meta_device

= config.init_fsdp_with_meta_device

keep_fp8_transpose_cache

= config.keep_fp8_transpose_cache

megatron_fsdp_unit_modules

= config.megatron_fsdp_unit_modules

nccl_ub

= config.nccl_ub

overlap_grad_reduce

= config.overlap_grad_reduce

overlap_param_gather

= config.overlap_param_gather

preserve_fp32_weights

= config.preserve_fp32_weights

zero_dp_strategy

= config.zero_dp_strategy

nemo_automodel.components.distributed.megatron_fsdp.MegatronFSDPManager.parallelize(
    model,
    optimizer = None
)

Parallelizes the given model using MegatronFSDP and TP sharding strategies.

Parameters:

model

The model to be parallelized.

optimizer

Defaults to None

The optimizer for the model. If None, user needs to call model.finish_grad_sync() before optimizer.step(), model.install_optimized_model_weights() and model.zero_grad_buffer() after optimizer.zero_grad().

Returns:

(parallelized_model, optimizer)

nemo_automodel.components.distributed.megatron_fsdp.fully_shard_optimizer(
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    preproc_state_dict_for_dcp_ckpt: bool = True
) -> torch.optim.Optimizer

nemo_automodel.components.distributed.megatron_fsdp.maybe_shard_optimizer(
    model_part: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    distributed_config: nemo_automodel.components.distributed.config.DistributedConfig | None,
    allow: bool = True
) -> torch.optim.Optimizer

Shard the optimizer with Megatron-FSDP when the strategy requires it.

Returns the optimizer unchanged unless distributed_config is a :class:MegatronFSDPConfig running in a distributed (world size > 1) job.

Parameters:

model_part

nn.Module

The (already sharded) model part the optimizer belongs to.

optimizer

torch.optim.Optimizer

The optimizer to (optionally) shard.

distributed_config

DistributedConfig | None

Distributed strategy config; only triggers sharding when it is a :class:MegatronFSDPConfig.

allow

boolDefaults to True

Guard for optimizers incompatible with Megatron-FSDP sharding (e.g. Dion); asserts when sharding would otherwise apply.

nemo_automodel.components.distributed.megatron_fsdp.HAS_MEGATRON_FSDP = True

nemo_automodel.components.distributed.megatron_fsdp.logger = logging.getLogger(__name__)