nemo_automodel.components.training.signal_handler

View as Markdown

Module Contents

Classes

NameDescription
DistributedSignalHandlerContext manager to handle signals gracefully in a distributed setting.

Functions

NameDescription
all_gather_itemPerform an all_gather operation on a single Python object.
get_deviceGet the appropriate torch device based on the distributed backend.

API

class nemo_automodel.components.training.signal_handler.DistributedSignalHandler(
sig: int = signal.SIGTERM
)

Context manager to handle signals gracefully in a distributed setting.

Installs a signal handler upon entering the context that sets a flag when the specified signal is received. The signals_received method can be used to check if any rank received the signal (using all_gather). The original signal handler is restored upon exiting the context.

Parameters:

sig
intDefaults to signal.SIGTERM

The signal number to handle (e.g., signal.SIGTERM). Defaults to signal.SIGTERM.

Enters the signal-managed area.

Returns: DistributedSignalHandler

returns self.

nemo_automodel.components.training.signal_handler.DistributedSignalHandler.__exit__(
exc_type: typing.Optional[type],
exc_val: BaseException | None,
exc_tb: types.TracebackType | None
) -> None

Release the signal handler and restore the original handler.

nemo_automodel.components.training.signal_handler.DistributedSignalHandler.release() -> bool

Restore the original signal handler.

Returns: bool

True if the handler was released, False if it was already released.

nemo_automodel.components.training.signal_handler.DistributedSignalHandler.signals_received() -> list[bool]

Check if any rank in the default group received the signal.

Uses all_gather to collect the signal status from all ranks.

Returns: list[bool]

A list of booleans, where each element indicates if the

nemo_automodel.components.training.signal_handler.all_gather_item(
item: typing.Any,
dtype: torch.dtype,
group: typing.Optional[torch.distributed.ProcessGroup] = None,
async_op: bool = False,
local_rank: typing.Optional[int] = None
) -> list[typing.Any]

Perform an all_gather operation on a single Python object.

Converts the item to a tensor, performs all_gather, and converts back to a list of Python objects from all ranks.

Parameters:

item
Any

The Python object to gather.

dtype
torch.dtype

The torch dtype to use for the intermediate tensor.

group
Optional[torch.distributed.ProcessGroup]Defaults to None

The process group to gather within (defaults to the global group).

async_op
boolDefaults to False

Whether the operation should be asynchronous.

local_rank
Optional[int]Defaults to None

The local rank to determine the device.

Returns: list[Any]

list[Any]: A list containing the gathered items (of type Any) from all ranks in the group.

nemo_automodel.components.training.signal_handler.get_device(
local_rank: typing.Optional[int] = None
) -> torch.device

Get the appropriate torch device based on the distributed backend.

Parameters:

local_rank
Optional[int]Defaults to None

The local rank, used to specify the CUDA device index for NCCL. If None, uses the default CUDA device.

Returns: torch.device

The torch.device (‘cuda’ for NCCL, ‘cpu’ for Gloo).

Raises:

  • RuntimeError: If the distributed backend is neither ‘nccl’ nor ‘gloo’.