nemo_automodel.components.training.signal_handler

Module Contents

Classes

Name	Description
`DistributedSignalHandler`	Context manager to handle signals gracefully in a distributed setting.

Functions

Name	Description
`all_gather_item`	Perform an all_gather operation on a single Python object.
`get_device`	Get the appropriate torch device based on the distributed backend.

API

class nemo_automodel.components.training.signal_handler.DistributedSignalHandler(
    sig: int = signal.SIGTERM
)

Context manager to handle signals gracefully in a distributed setting.

Installs a signal handler upon entering the context that sets a flag when the specified signal is received. The signals_received method can be used to check if any rank received the signal (using all_gather). The original signal handler is restored upon exiting the context.

Parameters:

sig

intDefaults to signal.SIGTERM

The signal number to handle (e.g., signal.SIGTERM). Defaults to signal.SIGTERM.

nemo_automodel.components.training.signal_handler.DistributedSignalHandler.__enter__() -> nemo_automodel.components.training.signal_handler.DistributedSignalHandler

Enters the signal-managed area.

Returns: DistributedSignalHandler

returns self.

nemo_automodel.components.training.signal_handler.DistributedSignalHandler.__exit__(
    exc_type: typing.Optional[type],
    exc_val: BaseException | None,
    exc_tb: types.TracebackType | None
) -> None

Release the signal handler and restore the original handler.

nemo_automodel.components.training.signal_handler.DistributedSignalHandler.release() -> bool

Restore the original signal handler.

Returns: bool

True if the handler was released, False if it was already released.

nemo_automodel.components.training.signal_handler.DistributedSignalHandler.signals_received() -> list[bool]

Check if any rank in the default group received the signal.

Uses all_gather to collect the signal status from all ranks.

Returns: list[bool]

A list of booleans, where each element indicates if the

nemo_automodel.components.training.signal_handler.all_gather_item(
    item: typing.Any,
    dtype: torch.dtype,
    group: typing.Optional[torch.distributed.ProcessGroup] = None,
    async_op: bool = False,
    local_rank: typing.Optional[int] = None
) -> list[typing.Any]

Perform an all_gather operation on a single Python object.

Converts the item to a tensor, performs all_gather, and converts back to a list of Python objects from all ranks.

Parameters:

item

Any

The Python object to gather.

dtype

torch.dtype

The torch dtype to use for the intermediate tensor.

group

Optional[torch.distributed.ProcessGroup]Defaults to None

The process group to gather within (defaults to the global group).

async_op

boolDefaults to False

Whether the operation should be asynchronous.

local_rank

Optional[int]Defaults to None

The local rank to determine the device.

Returns: list[Any]

list[Any]: A list containing the gathered items (of type Any) from all ranks in the group.

nemo_automodel.components.training.signal_handler.get_device(
    local_rank: typing.Optional[int] = None
) -> torch.device

Get the appropriate torch device based on the distributed backend.

Parameters:

local_rank

Optional[int]Defaults to None

The local rank, used to specify the CUDA device index for NCCL. If None, uses the default CUDA device.

Returns: torch.device

The torch.device (‘cuda’ for NCCL, ‘cpu’ for Gloo).

Raises:

RuntimeError: If the distributed backend is neither ‘nccl’ nor ‘gloo’.