NVIDIA Modulus Core v0.2.1
v0.2.1

Modulus Distributed

class modulus.distributed.manager.DistributedManager[source]

Bases: object

Distributed Manager for setting up distributed training enviroment.

This is a singleton that creates a persistance class instance for storing parallel environment information through out the life time of the program. This should be used to help set up Distributed Data Parallel and parallel datapipes.

Note

One should call DistributedManager.initialize() prior to constructing a manager object

Example

Copy
Copied!
            

>>> DistributedManager.initialize() >>> manager = DistributedManager() >>> manager.rank 0 >>> manager.world_size 1

property broadcast_buffers

broadcast_buffers in PyTorch DDP

static cleanup()[source]

Clean up distributed group and singleton

property cuda

If cuda is available

property device

Process device

property distributed

Distributed enviroment

property find_unused_parameters

find_unused_parameters in PyTorch DDP

static get_available_backend()[source]

Get communication backend

group(name=None)[source]

Returns a process group with the given name If name is None, group is also None indicating the default process group If named group does not exist, returns None also

group_name(group=None)[source]

Returns the name of process group

property group_names

Returns a list of all named process groups created

group_rank(name=None)[source]

Returns the rank in named process group

group_size(name=None)[source]

Returns the size of named process group

static initialize()[source]

Initialize distributed manager

static initialize_env()[source]

Setup method using generic initialization

static initialize_open_mpi(addr, port)[source]

Setup method using OpenMPI initialization

static initialize_slurm(port)[source]

Setup method using SLURM initialization

classmethod is_initialized() → bool[source]

If manager singleton has been initialized

property local_rank

Process rank on local machine

property rank

Process rank

static setup(rank=0, world_size=1, local_rank=None, addr='localhost', port='12355', backend='nccl', method='env')[source]

Set up PyTorch distributed process group and update manager attributes

property world_size

Number of processes in distributed enviroment

modulus.distributed.utils.gather_loss(loss: float, dst_rank: int = 0, mean: bool = True)[source]

Gathers loss from all processes to one for logging

Parameters
  • loss (float) – loss value

  • dst_rank (int, optional) – destination rank to gather to, by default 0

  • mean (bool, optional) – Calculate the mean of the losses gathered, by default True

Raises

Exception – If DistributedManager has yet to be initialized

© Copyright 2023, NVIDIA Modulus Team. Last updated on Sep 21, 2023.