nemo_automodel.components.distributed.ddp#
Module Contents#
Classes#
Manager for distributed training using PyTorch’s DDP. |
Data#
API#
- nemo_automodel.components.distributed.ddp.logger#
‘getLogger(…)’
- class nemo_automodel.components.distributed.ddp.DDPManager( )#
Manager for distributed training using PyTorch’s DDP.
This manager wraps models with DistributedDataParallel for data-parallel distributed training.
- Parameters:
config (DDPConfig) – Configuration for DDP distributed training.
.. rubric:: Example
from nemo_automodel.components.distributed.config import DDPConfig
config = DDPConfig(activation_checkpointing=True) manager = DDPManager(config) model = manager.parallelize(model)
Initialization
- _setup_distributed()#
Initialize device configuration for DDP.
Sets the rank, world_size, and device based on the backend.
- parallelize(model)#
Wraps the given model with DistributedDataParallel (DDP).
Moves the model to the initialized device before wrapping. For CUDA devices, the device id is passed to DDP as device_ids; for CPU, no device ids are provided.
- Parameters:
model (torch.nn.Module) – The PyTorch model to be wrapped.
- Returns:
The DDP-wrapped model.
- Return type:
torch.nn.parallel.DistributedDataParallel