`nemo_automodel.components.distributed.ddp`#

Module Contents#

Classes#

DDPManager

Manager for distributed training using PyTorch’s DDP.

Data#

logger

API#

nemo_automodel.components.distributed.ddp.logger#: ‘getLogger(…)’

class nemo_automodel.components.distributed.ddp.DDPManager( config: nemo_automodel.components.distributed.config.DDPConfig, )#

Manager for distributed training using PyTorch’s DDP.

This manager wraps models with DistributedDataParallel for data-parallel distributed training.

Parameters:: config (DDPConfig) – Configuration for DDP distributed training.

.. rubric:: Example

from nemo_automodel.components.distributed.config import DDPConfig

config = DDPConfig(activation_checkpointing=True) manager = DDPManager(config) model = manager.parallelize(model)

Initialization

_setup_distributed()#

Initialize device configuration for DDP.

Sets the rank, world_size, and device based on the backend.

parallelize(model)#

Wraps the given model with DistributedDataParallel (DDP).

Moves the model to the initialized device before wrapping. For CUDA devices, the device id is passed to DDP as device_ids; for CPU, no device ids are provided.

Parameters:: model (torch.nn.Module) – The PyTorch model to be wrapped.
Returns:: The DDP-wrapped model.
Return type:: torch.nn.parallel.DistributedDataParallel

nemo_automodel.components.distributed.ddp#

Module Contents#

Classes#

Data#

API#

`nemo_automodel.components.distributed.ddp`#