nemo_automodel.components.distributed.ddp
#
Module Contents#
Classes#
Manages setting up distributed training using PyTorch’s DDP. |
Data#
API#
- nemo_automodel.components.distributed.ddp.logger#
‘getLogger(…)’
- class nemo_automodel.components.distributed.ddp.DDPManager#
Manages setting up distributed training using PyTorch’s DDP.
.. attribute:: backend
The distributed backend to use (e.g. “nccl” or “gloo”). Defaults to “nccl”.
- Type:
str
.. attribute:: rank
Global rank of this process. This is set during distributed setup.
- Type:
int
.. attribute:: world_size
Total number of processes in the distributed group. Set at distributed setup.
- Type:
int
- backend: str#
‘field(…)’
- world_size: int#
‘field(…)’
- rank: int#
‘field(…)’
- activation_checkpointing: bool#
‘field(…)’
- __post_init__()#
Post-initialization hook that sets up the distributed environment.
- _setup_distributed()#
Initialize the torch.distributed process group and set up device configuration.
The method sets the
rank
andworld_size
of the DDPManager, configures the device (GPU for ‘nccl’ backend, CPU otherwise), and initializes the process group.
- parallelize(model)#
Wraps the given model with DistributedDataParallel (DDP).
Moves the model to the initialized device before wrapping. For CUDA devices, the device id is passed to DDP as device_ids; for CPU, no device ids are provided.
- Parameters:
model (torch.nn.Module) – The PyTorch model to be wrapped.
- Returns:
The DDP-wrapped model.
- Return type:
torch.nn.parallel.DistributedDataParallel