nemo_automodel.components.distributed.megatron_fsdp#
Module Contents#
Classes#
Manager for parallelizing models using MegatronFSDP with TP, DP, CP sharding. |
Functions#
Data#
API#
- nemo_automodel.components.distributed.megatron_fsdp.logger#
‘getLogger(…)’
- class nemo_automodel.components.distributed.megatron_fsdp.MegatronFSDPManager(
- config: nemo_automodel.components.distributed.config.MegatronFSDPConfig,
- device_mesh: torch.distributed.device_mesh.DeviceMesh,
Manager for parallelizing models using MegatronFSDP with TP, DP, CP sharding.
This manager applies parallelization to the model using a prescribed TP sharding plan. It supports mixed precision and various FSDP options.
The device mesh must be created externally and passed in.
- Parameters:
config (MegatronFSDPConfig) – Configuration for MegatronFSDP distributed training.
device_mesh (DeviceMesh) – Device mesh for distributed operations.
.. rubric:: Example
from nemo_automodel.components.distributed.config import MegatronFSDPConfig
config = MegatronFSDPConfig(zero_dp_strategy=3, overlap_grad_reduce=True)
device_mesh created externally via create_device_mesh()#
manager = MegatronFSDPManager(config, device_mesh=device_mesh) model, optimizer = manager.parallelize(model, optimizer)
Initialization
- parallelize(model, optimizer=None)#
Parallelizes the given model using MegatronFSDP and TP sharding strategies.
- Parameters:
model – The model to be parallelized.
optimizer – The optimizer for the model. If None, user needs to call model.finish_grad_sync() before optimizer.step(), model.install_optimized_model_weights() and model.zero_grad_buffer() after optimizer.zero_grad().
- Returns:
(parallelized_model, optimizer)
- Return type:
tuple
- nemo_automodel.components.distributed.megatron_fsdp.fully_shard_optimizer(
- model: torch.nn.Module,
- optimizer: torch.optim.Optimizer,
- preproc_state_dict_for_dcp_ckpt: bool = True,