nemo_automodel.components.distributed.utils
nemo_automodel.components.distributed.utils
Module Contents
Classes
Functions
Data
API
Bases: ContextDecorator
Context manager to enforce rank0 to process section over other ranks.
- Lets RANK==0 run the protected code first on each node.
- Inserts an extra barrier across only the node-local rank-0 processes.
- Works on a single GPU (no env flags, no distributed initialisation).
Note: it is assumed the scoped code is not torch.distributed heavy.
Create / bootstrap a (distributed) proc. group that rank0 enters first.
Returns:
True - if the current process is node-rank-0
False - otherwise
Tear down the context.
- If the current process was the first on its node, release the waiting peer ranks by issuing a barrier.
- If an exception occurred, abort the entire distributed job.
- If this context manager created the process group, destroy it.
Parameters:
Exception class if one
occurred inside the with block.
The raised exception instance.
Traceback associated with the exception.
Returns:
False so that any exception raised inside the with
block is propagated to the caller (standard CM semantics).
A timeout wrapper for torch.distributed.barrier() using Gloo backend.
This approach creates a separate Gloo process group for barrier operations while keeping the main NCCL backend for training operations.
Parameters:
Maximum time to wait for the barrier
Process group for the barrier operation
Returns:
True if barrier completed successfully, False if timeout occurred
Create a Gloo process group for barrier operations.
This allows us to use monitored_barrier with Gloo backend while keeping NCCL for the main training operations.
Returns:
Gloo process group for barriers
Perform a distributed barrier and then log a message on rank 0.
Parameters:
The message string to log.
Return (dp_rank, dp_size) to shard eval samples across DP ranks, else None.
Only DDP replicates the full model on every rank, so only there can each rank
score a disjoint subset independently. Under FSDP2 / MegatronFSDP the model is
sharded and generate() issues per-layer all-gather collectives that must
stay in lockstep across the group — different samples per rank would desync
them and hang — so every rank scores the same samples (None).
Get the synchronization context for the model.
Parameters:
The model to synchronize.
Whether the current step is an optimizer step.
Controls FSDP2 gradient synchronization during gradient accumulation.
- True: disable gradient sync on non-final micro-batches (saves comm, can increase peak memory).
- False: always sync gradients on every micro-batch (more comm, lower peak memory).
Returns:
A context manager that synchronizes the model.
Reduce loss across all ranks.
Parameters:
List of loss tensors to reduce.
Total number of tokens to divide the loss by.
Whether to divide the loss by the number of tokens.
Process group to reduce the loss across.
Returns: tuple[torch.Tensor, torch.Tensor]
Tuple of reduced loss and denominator.