nemo_automodel.distributed.ddp
#
Module Contents#
Classes#
Manages setting up distributed training using PyTorch’s DDP. |
API#
- class nemo_automodel.distributed.ddp.DDPManager[source]#
Manages setting up distributed training using PyTorch’s DDP.
.. attribute:: backend
The distributed backend to use (e.g. “nccl” or “gloo”). Defaults to “nccl”.
- Type:
str
.. attribute:: rank
Global rank of this process. This is set during distributed setup.
- Type:
int
.. attribute:: world_size
Total number of processes in the distributed group. Set at distributed setup.
- Type:
int
- backend: str#
‘field(…)’
- world_size: int#
‘field(…)’
- rank: int#
‘field(…)’
- setup_distributed()[source]#
Initialize the torch.distributed process group and set up device configuration.
This method requires the following environment variables to be set: - RANK: Global rank of the process. - WORLD_SIZE: Total number of processes. - MASTER_ADDR: Address of the master node. - MASTER_PORT: Port on which the master node is listening.
The method sets the
rank
andworld_size
of the DDPManager, configures the device (GPU for ‘nccl’ backend, CPU otherwise), and initializes the process group.
- wrap_model(model)[source]#
Wraps the given model with DistributedDataParallel (DDP).
Moves the model to the initialized device before wrapping. For CUDA devices, the device id is passed to DDP as device_ids; for CPU, no device ids are provided.
- Parameters:
model (torch.nn.Module) – The PyTorch model to be wrapped.
- Returns:
The DDP-wrapped model.
- Return type:
torch.nn.parallel.DistributedDataParallel