Modulus Distributed
- class modulus.distributed.manager.DistributedManager[source]
Bases:
object
Distributed Manager for setting up distributed training enviroment.
This is a singleton that creates a persistance class instance for storing parallel environment information through out the life time of the program. This should be used to help set up Distributed Data Parallel and parallel datapipes.
NoteOne should call DistributedManager.initialize() prior to constructing a manager object
Example
>>> DistributedManager.initialize() >>> manager = DistributedManager() >>> manager.rank 0 >>> manager.world_size 1
- property broadcast_buffers
broadcast_buffers in PyTorch DDP
- static cleanup()[source]
Clean up distributed group and singleton
- property cuda
If cuda is available
- property device
Process device
- property distributed
Distributed enviroment
- property find_unused_parameters
find_unused_parameters in PyTorch DDP
- static get_available_backend()[source]
Get communication backend
- group(name=None)[source]
Returns a process group with the given name If name is None, group is also None indicating the default process group If named group does not exist, returns None also
- group_name(group=None)[source]
Returns the name of process group
- property group_names
Returns a list of all named process groups created
- group_rank(name=None)[source]
Returns the rank in named process group
- group_size(name=None)[source]
Returns the size of named process group
- static initialize()[source]
Initialize distributed manager
- static initialize_env()[source]
Setup method using generic initialization
- static initialize_open_mpi(addr, port)[source]
Setup method using OpenMPI initialization
- static initialize_slurm(port)[source]
Setup method using SLURM initialization
- classmethod is_initialized() → bool[source]
If manager singleton has been initialized
- property local_rank
Process rank on local machine
- property rank
Process rank
- static setup(rank=0, world_size=1, local_rank=None, addr='localhost', port='12355', backend='nccl', method='env')[source]
Set up PyTorch distributed process group and update manager attributes
- property world_size
Number of processes in distributed enviroment
- modulus.distributed.utils.gather_loss(loss: float, dst_rank: int = 0, mean: bool = True)[source]
Gathers loss from all processes to one for logging
- Parameters
loss (float) – loss value
dst_rank (int, optional) – destination rank to gather to, by default 0
mean (bool, optional) – Calculate the mean of the losses gathered, by default True
- Raises
Exception – If DistributedManager has yet to be initialized