`nemo_automodel.components.distributed.ddp`#

Module Contents#

Classes#

DDPManager

Manages setting up distributed training using PyTorch’s DDP.

Data#

logger

API#

nemo_automodel.components.distributed.ddp.logger#: ‘getLogger(…)’

class nemo_automodel.components.distributed.ddp.DDPManager#

Manages setting up distributed training using PyTorch’s DDP.

.. attribute:: backend

The distributed backend to use (e.g. “nccl” or “gloo”). Defaults to “nccl”.

Type:: str

.. attribute:: rank

Global rank of this process. This is set during distributed setup.

Type:: int

.. attribute:: world_size

Total number of processes in the distributed group. Set at distributed setup.

Type:: int

backend: str#: ‘field(…)’

world_size: int#: ‘field(…)’

rank: int#: ‘field(…)’

activation_checkpointing: bool#: ‘field(…)’

__post_init__()#: Post-initialization hook that sets up the distributed environment.

_setup_distributed()#

Initialize the torch.distributed process group and set up device configuration.

The method sets the rank and world_size of the DDPManager, configures the device (GPU for ‘nccl’ backend, CPU otherwise), and initializes the process group.

parallelize(model)#

Wraps the given model with DistributedDataParallel (DDP).

Moves the model to the initialized device before wrapping. For CUDA devices, the device id is passed to DDP as device_ids; for CPU, no device ids are provided.

Parameters:: model (torch.nn.Module) – The PyTorch model to be wrapped.
Returns:: The DDP-wrapped model.
Return type:: torch.nn.parallel.DistributedDataParallel

nemo_automodel.components.distributed.ddp#

Module Contents#

Classes#

Data#

API#

`nemo_automodel.components.distributed.ddp`#