`nemo_automodel.components.distributed.megatron_fsdp`#

Module Contents#

Manager for parallelizing models using MegatronFSDP with TP, DP, CP sharding.

class nemo_automodel.components.distributed.megatron_fsdp.MegatronFSDPManager( config: nemo_automodel.components.distributed.config.MegatronFSDPConfig, device_mesh: torch.distributed.device_mesh.DeviceMesh, )#

Manager for parallelizing models using MegatronFSDP with TP, DP, CP sharding.

This manager applies parallelization to the model using a prescribed TP sharding plan. It supports mixed precision and various FSDP options.

The device mesh must be created externally and passed in.

Parameters:

config (MegatronFSDPConfig) – Configuration for MegatronFSDP distributed training.
device_mesh (DeviceMesh) – Device mesh for distributed operations.

.. rubric:: Example

from nemo_automodel.components.distributed.config import MegatronFSDPConfig

config = MegatronFSDPConfig(zero_dp_strategy=3, overlap_grad_reduce=True)

manager = MegatronFSDPManager(config, device_mesh=device_mesh) model, optimizer = manager.parallelize(model, optimizer)

Initialization

parallelize(model, optimizer=None)#

Parallelizes the given model using MegatronFSDP and TP sharding strategies.

Parameters:

model – The model to be parallelized.
optimizer – The optimizer for the model. If None, user needs to call model.finish_grad_sync() before optimizer.step(), model.install_optimized_model_weights() and model.zero_grad_buffer() after optimizer.zero_grad().

Returns:

(parallelized_model, optimizer)

Return type:

tuple

nemo_automodel.components.distributed.megatron_fsdp.fully_shard_optimizer( model: torch.nn.Module, optimizer: torch.optim.Optimizer, preproc_state_dict_for_dcp_ckpt: bool = True, ) → torch.optim.Optimizer#