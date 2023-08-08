NVIDIA Modulus Core v0.1.0
class modulus.distributed.manager.DistributedManager[source]

Bases: object

Distributed Manager for setting up distributed training enviroment.

This is a singleton that creates a persistance class instance for storing parallel environment information through out the life time of the program. This should be used to help set up Distributed Data Parallel and parallel datapipes.

Note

One should call DistributedManager.initialize() prior to constructing a manager object

Example

>>> DistributedManager.initialize()
>>> manager = DistributedManager()
>>> manager.rank
0
>>> manager.world_size
1

property broadcast_buffers

broadcast_buffers in PyTorch DDP
static cleanup()[source]

Clean up distributed group and singleton
property cuda

If cuda is available
property device

Process device
property distributed

Distributed enviroment
property find_unused_parameters

find_unused_parameters in PyTorch DDP
static get_available_backend()[source]

Get communication backend
group(name=None)[source]

Returns a process group with the given name If name is None, group is also None indicating the default process group If named group does not exist, returns None also
group_name(group=None)[source]

Returns the name of process group
property group_names

Returns a list of all named process groups created
group_rank(name=None)[source]

Returns the rank in named process group
group_size(name=None)[source]

Returns the size of named process group
static initialize()[source]

Initialize distributed manager
static initialize_env()[source]

Setup method using generic initialization
static initialize_open_mpi(addr, port)[source]

Setup method using OpenMPI initialization
static initialize_slurm(port)[source]

Setup method using SLURM initialization
classmethod is_initialized() → bool[source]

If manager singleton has been initialized
property local_rank

Process rank on local machine
property rank

Process rank
static setup(rank=0, world_size=1, local_rank=None, addr='localhost', port='12355', backend='nccl', method='env')[source]

Set up PyTorch distributed process group and update manager attributes
property world_size

Number of processes in distributed enviroment
modulus.distributed.utils.gather_loss(loss: float, dst_rank: int = 0, mean: bool = True)[source]

Gathers loss from all processes to one for logging

Parameters

  • loss (float) – loss value

  • dst_rank (int, optional) – destination rank to gather to, by default 0

  • mean (bool, optional) – Calculate the mean of the losses gathered, by default True
Raises

Exception – If DistributedManager has yet to be initialized
