> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.training.utils

## Module Contents

### Classes

| Name                                                                                         | Description                                                           |
| -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| [`ScopedModuleOffloading`](#nemo_automodel-components-training-utils-ScopedModuleOffloading) | Context manager that temporarily moves a module between CPU and CUDA. |

### Functions

| Name                                                                                                         | Description                                                            |
| ------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------- |
| [`_clip_grad_norm_impl`](#nemo_automodel-components-training-utils-_clip_grad_norm_impl)                     | -                                                                      |
| [`clip_grad_norm`](#nemo_automodel-components-training-utils-clip_grad_norm)                                 | Common gradient clipping helper.                                       |
| [`count_tail_padding`](#nemo_automodel-components-training-utils-count_tail_padding)                         | Counts the total number of padding token in the tail of labels         |
| [`move_to_device`](#nemo_automodel-components-training-utils-move_to_device)                                 | Move a model and its buffers to a device and release stale CUDA cache. |
| [`prepare_after_first_microbatch`](#nemo_automodel-components-training-utils-prepare_after_first_microbatch) | Disable first-microbatch flag after the first forward-backward pass.   |
| [`prepare_for_final_backward`](#nemo_automodel-components-training-utils-prepare_for_final_backward)         | Prepare model parts before the final backward pass.                    |
| [`prepare_for_grad_accumulation`](#nemo_automodel-components-training-utils-prepare_for_grad_accumulation)   | Prepare model parts before starting gradient accumulation.             |
| [`scale_grads_and_clip_grad_norm`](#nemo_automodel-components-training-utils-scale_grads_and_clip_grad_norm) | Scale gradients for PP/EP in a single pass, then clip.                 |

### Data

[`_TE_EXPERT_PARAM_PATTERN`](#nemo_automodel-components-training-utils-_TE_EXPERT_PARAM_PATTERN)

### API

```python
class nemo_automodel.components.training.utils.ScopedModuleOffloading(
    model,
    enabled = False
)
```

Context manager that temporarily moves a module between CPU and CUDA.

```python
nemo_automodel.components.training.utils.ScopedModuleOffloading.__enter__()
```

```python
nemo_automodel.components.training.utils.ScopedModuleOffloading.__exit__(
    exc_type,
    exc_val,
    exc_tb
)
```

```python
nemo_automodel.components.training.utils._clip_grad_norm_impl(
    parameters: torch.Tensor | typing.Iterable[torch.Tensor],
    max_norm: float,
    norm_type: float = 2.0,
    error_if_nonfinite: bool = False,
    foreach: bool | None = None,
    pp_mesh: torch.distributed.device_mesh.DeviceMesh | None = None
) -> torch.Tensor
```

```python
nemo_automodel.components.training.utils.clip_grad_norm(
    max_grad_norm: float | None,
    model_parts: list[torch.nn.Module],
    norm_type: float = 2.0,
    pp_enabled: bool = False,
    device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
    pp_axis_name: str | None = None,
    foreach: bool = True,
    use_torch_clip_grad_norm: bool = False
)
```

Common gradient clipping helper.

Handles all parallelism strategies (TP, PP, EP/MoE) with automatic sharding-aware grouping.
Returns the gradient norm as a float, or 0.0 if clipping is skipped.

This function automatically:

* Groups parameters by sharding pattern (device mesh + placements)
* Computes norms correctly across different sharding strategies
* Handles MoE with separate DP/EP meshes
* Reduces norms across pipeline parallel stages when enabled

**Parameters:**

Maximum gradient norm. If None, skips clipping.

List of model modules to clip.

Type of norm to use (default: 2.0 for L2).

Whether pipeline parallelism is enabled.

Device mesh for parallelism.

MoE-specific device mesh (unused, kept for API compatibility).

Expert parallel axis name (unused, kept for API compatibility).

Pipeline parallel axis name.

Whether to use foreach implementation for clipping.

Use PyTorch's optimized regular-tensor clipping path when possible.

**Returns:**

Total gradient norm as a float.

```python
nemo_automodel.components.training.utils.count_tail_padding(
    labels,
    ignore_label = -100
)
```

Counts the total number of padding token in the tail of labels

e.g.
labels = torch.tensor(\[
\[-100, 1, 1, -100, -100],   # 2 tail -100s
\[-100, -100, 2, 3, 4],      # 0 tail -100s
\[5, 6, -100, -100, -100],   # 3 tail -100s
])
count\_tail\_padding will return 5. Please do note there's more than 5 ignore labels.
Args:
labels (torch.Tensor): the labels
ignore\_label (int, optional): ignore label index. Defaults to -100.

**Returns:**

total number of ignored tokens in the `labels` input.

```python
nemo_automodel.components.training.utils.move_to_device(
    model,
    device
)
```

Move a model and its buffers to a device and release stale CUDA cache.

```python
nemo_automodel.components.training.utils.prepare_after_first_microbatch()
```

Disable first-microbatch flag after the first forward-backward pass.

Called after the first microbatch in gradient accumulation so that
subsequent microbatches reuse cached FP8 weights instead of re-quantizing.

```python
nemo_automodel.components.training.utils.prepare_for_final_backward(
    model_parts: list[torch.nn.Module],
    pp_enabled: bool = False
)
```

Prepare model parts before the final backward pass.

This is typically called before the final gradient accumulation step to prepare
FSDP states for gradient synchronization and resharding.

**Parameters:**

List of model parts (modules) to prepare.

Whether pipeline parallelism is enabled.

```python
nemo_automodel.components.training.utils.prepare_for_grad_accumulation(
    model_parts: list[torch.nn.Module],
    pp_enabled: bool = False
)
```

Prepare model parts before starting gradient accumulation.

This is typically called once at the start of gradient accumulation to prepare
FSDP states for the upcoming forward and backward passes.

**Parameters:**

List of model parts (modules) to prepare.

Whether pipeline parallelism is enabled.

```python
nemo_automodel.components.training.utils.scale_grads_and_clip_grad_norm(
    max_grad_norm: float | None,
    model_parts: list[torch.nn.Module],
    norm_type: float = 2.0,
    pp_enabled: bool = False,
    device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
    moe_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
    ep_axis_name: str | None = None,
    pp_axis_name: str | None = None,
    foreach: bool = True,
    num_label_tokens: int | None = None,
    dp_group_size: int | None = None,
    use_torch_clip_grad_norm: bool = False
)
```

Scale gradients for PP/EP in a single pass, then clip.

* PP scaling: divide all local grads by (num\_label\_tokens / dp\_group\_size).
* EP scaling: for parameters on the expert axis, divide grads by (dp\_group\_size / ep\_shard\_size).
* Finally, perform grad clipping with PP/EP-aware reductions.

```python
nemo_automodel.components.training.utils._TE_EXPERT_PARAM_PATTERN = re.compile('(^|\\.)mlp\\.experts\\.(gate_up_linear|down_linear)\\.(weight|bias)\...
```