> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.training.signal_handler

## Module Contents

### Classes

| Name                                                                                                      | Description                                                            |
| --------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| [`DistributedSignalHandler`](#nemo_automodel-components-training-signal_handler-DistributedSignalHandler) | Context manager to handle signals gracefully in a distributed setting. |

### Functions

| Name                                                                                    | Description                                                        |
| --------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| [`all_gather_item`](#nemo_automodel-components-training-signal_handler-all_gather_item) | Perform an all\_gather operation on a single Python object.        |
| [`get_device`](#nemo_automodel-components-training-signal_handler-get_device)           | Get the appropriate torch device based on the distributed backend. |

### API

```python
class nemo_automodel.components.training.signal_handler.DistributedSignalHandler(
    sig: int = signal.SIGTERM
)
```

Context manager to handle signals gracefully in a distributed setting.

Installs a signal handler upon entering the context that sets a flag
when the specified signal is received. The `signals_received` method
can be used to check if any rank received the signal (using all\_gather).
The original signal handler is restored upon exiting the context.

**Parameters:**

The signal number to handle (e.g., signal.SIGTERM).
Defaults to signal.SIGTERM.

```python
nemo_automodel.components.training.signal_handler.DistributedSignalHandler.__enter__() -> nemo_automodel.components.training.signal_handler.DistributedSignalHandler
```

Enters the signal-managed area.

**Returns:** `DistributedSignalHandler`

returns self.

```python
nemo_automodel.components.training.signal_handler.DistributedSignalHandler.__exit__(
    exc_type: typing.Optional[type],
    exc_val: BaseException | None,
    exc_tb: types.TracebackType | None
) -> None
```

Release the signal handler and restore the original handler.

```python
nemo_automodel.components.training.signal_handler.DistributedSignalHandler.release() -> bool
```

Restore the original signal handler.

**Returns:** `bool`

True if the handler was released, False if it was already released.

```python
nemo_automodel.components.training.signal_handler.DistributedSignalHandler.signals_received() -> list[bool]
```

Check if any rank in the default group received the signal.

Uses all\_gather to collect the signal status from all ranks.

**Returns:** `list[bool]`

A list of booleans, where each element indicates if the

```python
nemo_automodel.components.training.signal_handler.all_gather_item(
    item: typing.Any,
    dtype: torch.dtype,
    group: typing.Optional[torch.distributed.ProcessGroup] = None,
    async_op: bool = False,
    local_rank: typing.Optional[int] = None
) -> list[typing.Any]
```

Perform an all\_gather operation on a single Python object.

Converts the item to a tensor, performs all\_gather, and converts back to a list
of Python objects from all ranks.

**Parameters:**

The Python object to gather.

The torch dtype to use for the intermediate tensor.

The process group to gather within
(defaults to the global group).

Whether the operation should be asynchronous.

The local rank to determine the device.

**Returns:** `list[Any]`

list\[Any]: A list containing the gathered items (of type Any) from all ranks in the group.

```python
nemo_automodel.components.training.signal_handler.get_device(
    local_rank: typing.Optional[int] = None
) -> torch.device
```

Get the appropriate torch device based on the distributed backend.

**Parameters:**

The local rank, used to specify the CUDA device index for NCCL.
If None, uses the default CUDA device.

**Returns:** `torch.device`

The torch.device ('cuda' for NCCL, 'cpu' for Gloo).

**Raises:**

* `RuntimeError`: If the distributed backend is neither 'nccl' nor 'gloo'.