nemo_automodel.components.training.signal_handler
nemo_automodel.components.training.signal_handler
Module Contents
Classes
Functions
API
Context manager to handle signals gracefully in a distributed setting.
Installs a signal handler upon entering the context that sets a flag
when the specified signal is received. The signals_received method
can be used to check if any rank received the signal (using all_gather).
The original signal handler is restored upon exiting the context.
Parameters:
The signal number to handle (e.g., signal.SIGTERM). Defaults to signal.SIGTERM.
Enters the signal-managed area.
Returns: DistributedSignalHandler
returns self.
Release the signal handler and restore the original handler.
Restore the original signal handler.
Returns: bool
True if the handler was released, False if it was already released.
Check if any rank in the default group received the signal.
Uses all_gather to collect the signal status from all ranks.
Returns: list[bool]
A list of booleans, where each element indicates if the
Perform an all_gather operation on a single Python object.
Converts the item to a tensor, performs all_gather, and converts back to a list of Python objects from all ranks.
Parameters:
The Python object to gather.
The torch dtype to use for the intermediate tensor.
The process group to gather within (defaults to the global group).
Whether the operation should be asynchronous.
The local rank to determine the device.
Returns: list[Any]
list[Any]: A list containing the gathered items (of type Any) from all ranks in the group.
Get the appropriate torch device based on the distributed backend.
Parameters:
The local rank, used to specify the CUDA device index for NCCL. If None, uses the default CUDA device.
Returns: torch.device
The torch.device (‘cuda’ for NCCL, ‘cpu’ for Gloo).
Raises:
RuntimeError: If the distributed backend is neither ‘nccl’ nor ‘gloo’.