nemo_automodel.components.distributed.megatron_fsdp#

Module Contents#

Classes#

MegatronFSDPManager

Manager for parallelizing models using MegatronFSDP with TP, DP, CP sharding.

Functions#

Data#

API#

nemo_automodel.components.distributed.megatron_fsdp.logger#

‘getLogger(…)’

class nemo_automodel.components.distributed.megatron_fsdp.MegatronFSDPManager(
config: nemo_automodel.components.distributed.config.MegatronFSDPConfig,
device_mesh: torch.distributed.device_mesh.DeviceMesh,
)#

Manager for parallelizing models using MegatronFSDP with TP, DP, CP sharding.

This manager applies parallelization to the model using a prescribed TP sharding plan. It supports mixed precision and various FSDP options.

The device mesh must be created externally and passed in.

Parameters:
  • config (MegatronFSDPConfig) – Configuration for MegatronFSDP distributed training.

  • device_mesh (DeviceMesh) – Device mesh for distributed operations.

.. rubric:: Example

from nemo_automodel.components.distributed.config import MegatronFSDPConfig

config = MegatronFSDPConfig(zero_dp_strategy=3, overlap_grad_reduce=True)

device_mesh created externally via create_device_mesh()#

manager = MegatronFSDPManager(config, device_mesh=device_mesh) model, optimizer = manager.parallelize(model, optimizer)

Initialization

parallelize(model, optimizer=None)#

Parallelizes the given model using MegatronFSDP and TP sharding strategies.

Parameters:
  • model – The model to be parallelized.

  • optimizer – The optimizer for the model. If None, user needs to call model.finish_grad_sync() before optimizer.step(), model.install_optimized_model_weights() and model.zero_grad_buffer() after optimizer.zero_grad().

Returns:

(parallelized_model, optimizer)

Return type:

tuple

nemo_automodel.components.distributed.megatron_fsdp.fully_shard_optimizer(
model: torch.nn.Module,
optimizer: torch.optim.Optimizer,
preproc_state_dict_for_dcp_ckpt: bool = True,
) torch.optim.Optimizer#