NVIDIA Docs Hub Homepage NVIDIA PhysicsNeMo NVIDIA Modulus Core v0.4.0 Modulus Distributed

Modulus Distributed

Distributed utilites in Modulus are designed to simplify implementation of parallel training and make inference scripts easier by providing a unified way to configure and query parameters associated with the distributed environment. The utilites in modulus.distributed build on top of the utilites from torch.distributed and abstract out some of the complexities of setting up a distributed execution environment.

The example below shows how to setup a simple distributed data parallel training recipe using the distributed utilites in Modulus. DistributedDataParallel in PyTorch provides the framework for data parallel training by reducing parameter gradients across multiple worker processes after the backwards pass. The code below shows how to specify the device_ids, output_device, broadcast_buffers and find_unused_parameters arguments of the DistributedDataParallel utility using the DistributedManager.

Copy
Copied!

            
            import torch
from torch.nn.parallel import DistributedDataParallel
from modulus.distributed import DistributedManager
from modulus.models.mlp.fully_connected import FullyConnected

def main():
    # Initialize the DistributedManager. This will automatically
    # detect the number of processes the job was launched with and
    # set those configuration parameters appropriately. Currently
    # torchrun (or any other pytorch compatible launcher), mpirun (OpenMPI)
    # and SLURM based launchers are supported.
    DistributedManager.initialize()

    # Since this is a singleton class, you can just get an instance
    # of it anytime after initialization and not need to reinitialize
    # each time.
    dist = DistributedManager()

    # Set up model on the appropriate device. DistributedManager
    # figures out what device should be used on this process
    arch = FullyConnected(in_features=32, out_features=64).to(dist.device)

    # Set up DistributedDataParallel if using more than a single process.
    # The `distributed` property of DistributedManager can be used to
    # check this.
    if dist.distributed:
        ddps = torch.cuda.Stream()
        with torch.cuda.stream(ddps):
            arch = DistributedDataParallel(
                arch,
                device_ids=[dist.local_rank],  # Set the device_id to be
                                               # the local rank of this process on
                                               # this node
                output_device=dist.device,
                broadcast_buffers=dist.broadcast_buffers,
                find_unused_parameters=dist.find_unused_parameters,
            )
        torch.cuda.current_stream().wait_stream(ddps)

    # Set up the optimizer
    optimizer = torch.optim.Adam(
        arch.parameters(),
        lr=0.001,
    )

    def training_step(input, target):
        pred = arch(invar)
        loss = torch.sum(torch.pow(pred - target, 2))
        loss.backward()
        optimizer.step()
        return loss

    # Sample training loop
    for i in range(20):
        # Random inputs and targets for simplicity
        input = torch.randn(128, 32, device=dist.device)
        target = torch.randn(128, 64, device=dist.device)

        # Training step
        loss = training_step(input, target)

if __name__ == "__main__":
    main()

This training script can be run on a single GPU using python train.py or on multiple GPUs using

Copy
Copied!

            
            torchrun --standalone --nnodes=1 --nproc_per_node=<num_gpus> train.py

Copy
Copied!

            
            mpirun -np <num_gpus> python train.py

if using OpenMPI. The script can also be run on a SLURM cluster using

Copy
Copied!

            
            srun -n <num_gpus> python train.py

How does this work?

An important aspect of the DistributedManager is that it is follows the Borg pattern. This means that DistributedManager essentially functions like a singleton class and once configured, all utilities in Modulus can access the same configuration and adapt to the specified distributed structure.

For example, see the constructor of the DistributedAFNO class:

Copy
Copied!

            
            def __init__(
        self,
        inp_shape: Tuple[int, int],
        in_channels: int,
        out_channels: Union[int, Any] = None,
        patch_size: int = 16,
        embed_dim: int = 256,
        depth: int = 4,
        num_blocks: int = 4,
        channel_parallel_inputs: bool = False,
        channel_parallel_outputs: bool = False,
    ) -> None:
        super().__init__()

        out_channels = out_channels or in_channels

        if DistributedManager().group("model_parallel") is None:
            raise RuntimeError(
                "Distributed AFNO needs to have model parallel group created first. "
                "Check the MODEL_PARALLEL_SIZE environment variable"
            )

        comm_size = DistributedManager().group_size("model_parallel")
        if channel_parallel_inputs:
            if not (in_channels % comm_size == 0):
                raise ValueError(
                    "Error, in_channels needs to be divisible by model_parallel size"
                )

        self._impl = DistributedAFNONet(
            inp_shape=inp_shape,
            patch_size=(patch_size, patch_size),
            in_chans=in_channels,
            out_chans=out_channels,
            embed_dim=embed_dim,
            depth=depth,
            num_blocks=num_blocks,
            input_is_matmul_parallel=False,
            output_is_matmul_parallel=False,
        )

This model parallel implementation can just instantiate DistributedManager and query if the process group named "model_parallel" exists and if so, what is it’s size. Similarly, other utilities can query what device to run on, the total size of the distributed run, etc. without having to explicitly pass those params down the call stack.

Note

This singleton/borg pattern is very useful for the DistributedManager since it takes charge of bootstrapping the distributed run and unifies how all utilities become aware of the distributed configuration. However, the singleton/borg pattern is not just a way to avoid passing parameters to utilities. Use of this pattern should be limited and have good justification to avoid losing tracability and keep the code readable.

modulus.distributed.manager

class modulus.distributed.manager.DistributedManager[source]

Bases: object

Distributed Manager for setting up distributed training enviroment.

This is a singleton that creates a persistance class instance for storing parallel environment information through out the life time of the program. This should be used to help set up Distributed Data Parallel and parallel datapipes.

Note

One should call DistributedManager.initialize() prior to constructing a manager object

Example

Copy
Copied!

            
            >>> DistributedManager.initialize()
>>> manager = DistributedManager()
>>> manager.rank
0
>>> manager.world_size
1

property broadcast_buffers

static cleanup()[source]

static create_orthogonal_process_group(name: str, group_name: str, verbose: bool = False)[source]

Create a process group that is orthogonal to the specified process group.

Parameters

static create_process_subgroup(name: str, size: int, group_name: Optional[str] = None, verbose: bool = False)[source]

Create a process subgroup of a parent process group. This must be a collective call by all processes participating in this application.

Parameters

property cuda

property device

property distributed

property find_unused_parameters

static get_available_backend()[source]

group(name=None)[source]

group_name(group=None)[source]

property group_names

group_rank(name=None)[source]

group_size(name=None)[source]

static initialize()[source]

Initialize distributed manager

Current supported initialization methods are:

ENV: PyTorch environment variable initialization
SLURM: Initialization on SLURM systems.
OPENMPI: Initialization for OpenMPI launchers.

Initialization by default is done using the first valid method in the order listed above. Initialization method can also be explicitly controlled using the MODULUS_DISTRIBUTED_INITIALIZATION_METHOD environment variable and setting it to one of the options above.

static initialize_env()[source]

static initialize_open_mpi(addr, port)[source]

static initialize_slurm(port)[source]

classmethod is_initialized() → bool[source]

property local_rank

property rank

static setup(rank=0, world_size=1, local_rank=None, addr='localhost', port='12355', backend='nccl', method='env')[source]

property world_size

modulus.distributed.utils

modulus.distributed.utils.all_gather_v_wrapper(tensor: Tensor, sizes: List[int], dim: int = 0, group: Optional[ProcessGroup] = None) → Tensor[source]

Implements a distributed AllGatherV primitive. It is based on the idea of a single global tensor which is distributed along a specified dimension into chunks of variable size. This primitive gathers all local tensors from each rank into the full global tensor onto each rank.

Parameters
Returns
Return type

modulus.distributed.utils.all_reduce_v_wrapper(tensor: Tensor, sizes: List[int], dim: int = 0, use_fp32: bool = True, group: Optional[ProcessGroup] = None) → Tensor[source]

Implements a distributed AllReduceV primitive. It is based on the idea of a single global tensor which which can be distributed along a specified dimension into chunks of variable size. This primitive assumes different global tensors of the same shape on each rank. It then re-distributes chunks of all these tensors such that each rank receives all corresponding parts of a global tensor. Each rank then sums up the chunks after receiving it. By design, this primitive thus implements the backward pass of the “all_gather_v” primitive. In this case, the result would be a single global gradient tensor distributed onto different ranks.

Parameters
Returns
Return type

modulus.distributed.utils.distributed_transpose(tensor, dim0, dim1, group=None, async_op=False)[source]

modulus.distributed.utils.gather_loss(loss: float, dst_rank: int = 0, mean: bool = True)[source]

Gathers loss from all processes to one for logging

Parameters
Raises

modulus.distributed.utils.gather_v_wrapper(tensor: Tensor, sizes: List[int], dim: int = 0, dst: int = 0, group: Optional[ProcessGroup] = None) → Tensor[source]

Implements a distributed GatherV primitive. It is based on the idea of a single global tensor which is distributed along a specified dimension into chunks of variable size. This primitive assumes such a distributed tensor and gathers all local tensors from each rank into the full global tensor valid on the specified destination rank.

Parameters
Returns
Return type

modulus.distributed.utils.get_memory_format(tensor)[source]

modulus.distributed.utils.indexed_all_to_all_v_wrapper(tensor: Tensor, indices: List[Tensor], sizes: List[List[int]], dim: int = 0, group: Optional[ProcessGroup] = None) → Tensor[source]

Implements an indexed version of a distributed AllToAllV primitive. It is based on the idea of a single global tensor which is distributed along a specified dimension into chunks of variable size. This primitive assumes a set of indices into this dimension which indicate the corresponding slices sent to each other rank forming an indexed version of an AllToAllV primitive.

Parameters
Returns
Return type

modulus.distributed.utils.indexed_all_to_all_v_wrapper_bwd(tensor: Tensor, indices: List[Tensor], sizes: List[List[int]], tensor_size_along_dim: int, use_fp32: bool = True, dim: int = 0, group: Optional[ProcessGroup] = None) → Tensor[source]

Implements the backward pass to the indexed version of a distributed AllToAllV primitive.

Parameters
Returns
Return type

modulus.distributed.utils.pad_helper(tensor, dim, new_size, mode='zero')[source]

modulus.distributed.utils.scatter_v_wrapper(tensor: Tensor, sizes: List[int], dim: int = 0, src: int = 0, group: Optional[ProcessGroup] = None) → Tensor[source]

Implements a distributed ScatterV primitive. It is based on the idea of a single global tensor which is distributed along a specified dimension into chunks of variable size. This primitive scatters the global tensor from a specified source rank into local chunks onto each other rank.

Parameters
Returns
Return type

modulus.distributed.utils.split_tensor_along_dim(tensor, dim, num_chunks)[source]

modulus.distributed.utils.truncate_helper(tensor, dim, new_size)[source]

modulus.distributed.autograd

class modulus.distributed.autograd.AllGatherVAutograd(*args, **kwargs)[source]

Bases: Function

Autograd Wrapper for a distributed AllGatherV primitive. It is based on the idea of a single global tensor which is distributed along a specified dimension into chunks of variable size. This primitive gathers all local tensors from each rank into the full global tensor onto each rank. Its indended to be used in tensor-parallel settings on tensors which require gradients to be passed through. The backward pass performs an AllReduceV operation where each rank gathers its corresponding chunk of a global tensor from each other rank and sums up these individual gradients.

static backward(ctx, grad_output: Tensor)[source]

static forward(ctx, tensor: Tensor, sizes: List[int], dim: int = 0, use_fp32: bool = True, group: Optional[ProcessGroup] = None) → Tensor[source]

class modulus.distributed.autograd.GatherVAutograd(*args, **kwargs)[source]

Bases: Function

Autograd Wrapper for a distributed GatherV primitive. It is based on the idea of a single global tensor which is distributed along a specified dimension into chunks of variable size. This primitive assumes such a distributed tensor and gathers all local tensors from each rank into the full global tensor valid on the specified destination rank. It is intended to be used in tensor-parallel settings on tensors which require gradients to be passed through. The backward pass corresponds to a straightforward ScatterV primitive distributing the global gradient from the specified destination rank to all the other ranks.

static backward(ctx, grad_output: Tensor) → Tensor[source]

static forward(ctx, tensor: Tensor, sizes: List[int], dim: int = 0, dst: int = 0, group: Optional[ProcessGroup] = None) → Tensor[source]

class modulus.distributed.autograd.IndexedAllToAllVAutograd(*args, **kwargs)[source]

Bases: Function

Autograd Wrapper for an Indexed AllToAllV primitive. It is based on the idea of a single global tensor which is distributed along a specified dimension into chunks of variable size. This primitive assumes a set of indices into this dimension which indicate the corresponding slices sent to each other rank forming an indexed version of an AllToAllV primitive. It is intended to be used in tensor-parallel settings on tensors which require gradients to be passed through. The backward pass more or less corresponds to the same operation as in the forward pass but with reversed roles and does an additional reduction of gathered gradients so that each rank finally will compute the overall gradient on its local tensor partition.

static backward(ctx, grad_output: Tensor) → Tensor[source]

static forward(ctx, tensor: Tensor, indices: List[Tensor], sizes: List[List[int]], use_fp32: bool = True, dim: int = 0, group: Optional[ProcessGroup] = None) → Tensor[source]

class modulus.distributed.autograd.ScatterVAutograd(*args, **kwargs)[source]

Bases: Function

Autograd Wrapper for Distributed ScatterV. It is based on the idea of a single global tensor which is distributed along a specified dimension into chunks of variable size. This primitive scatters the global tensor from a specified source rank into local chunks onto each other rank. It is intended to be used in tensor-parallel settings on tensors which require gradients to be passed through. The backward pass corresponds to an GatherV primitive gathering local gradients from all the other ranks into a single global gradient on the specified source rank.

static backward(ctx, grad_output: Tensor) → Tensor[source]

static forward(ctx, tensor: Tensor, sizes: List[int], dim: int = 0, src: int = 0, group=typing.Optional[torch.distributed.distributed_c10d.ProcessGroup]) → Tensor[source]

modulus.distributed.autograd.all_gather_v(tensor: Tensor, sizes: List[int], dim: int = 0, use_fp32: bool = True, group: Optional[ProcessGroup] = None) → Tensor[source]

Parameters
Returns
Return type

modulus.distributed.autograd.gather_v(tensor: Tensor, sizes: List[int], dim: int = 0, dst: int = 0, group: Optional[ProcessGroup] = None) → Tensor[source]

Parameters
Returns
Return type

modulus.distributed.autograd.indexed_all_to_all_v(tensor: Tensor, indices: List[Tensor], sizes: List[List[int]], use_fp32: bool = True, dim: int = 0, group: Optional[ProcessGroup] = None) → Tensor[source]

Parameters
Returns
Return type

modulus.distributed.autograd.scatter_v(tensor: Tensor, sizes: List[int], dim: int = 0, src: int = 0, group: Optional[ProcessGroup] = None) → Tensor[source]

Parameters
Returns
Return type

modulus.distributed.fft

class modulus.distributed.fft.DistributedIRFFT2(*args, **kwargs)[source]

Bases: Function

Autograd Wrapper for a distributed 2D real to complex IFFT primitive. It is based on the idea of a single global tensor which is distributed along a specified dimension into chunks of equal size. This primitive computes a 1D IFFT first along dim[1], then performs an AllToAll transpose before computing a 1D FFT along dim[0]. The backward pass performs an FFT operation with communication in the opposite order as in the forward pass.

For the forward method, data should be split along dim[0] across the “spatial_parallel” process group. The output is data split in dim[1].

static backward(ctx, grad_output)[source]

static forward(ctx, x, s, dim, norm='ortho')[source]

class modulus.distributed.fft.DistributedRFFT2(*args, **kwargs)[source]

Bases: Function

Autograd Wrapper for a distributed 2D real to complex FFT primitive. It is based on the idea of a single global tensor which is distributed along a specified dimension into chunks of equal size. This primitive computes a 1D FFT first along dim[0], then performs an AllToAll transpose before computing a 1D FFT along dim[1]. The backward pass performs an IFFT operation with communication in the opposite order as in the forward pass.

For the forward method, data should be split along dim[1] across the “spatial_parallel” process group. The output is data split in dim[0].

static backward(ctx, grad_output)[source]

static forward(ctx, x, s, dim, norm='ortho')[source]

modulus.distributed.mappings

modulus.distributed.mappings.copy_to_parallel_region(input, group)[source]

modulus.distributed.mappings.gather_from_parallel_region(input, dim, group)[source]

modulus.distributed.mappings.gather_within_parallel_region(input, dim, group)[source]

modulus.distributed.mappings.reduce_from_parallel_region(input, group)[source]

modulus.distributed.mappings.scatter_to_parallel_region(input, dim, group)[source]

Previous Modulus Deploy

Next Modulus Utils