nemo_automodel.components.training.utils#

Module Contents#

Classes#

Functions#

count_tail_padding

Counts the total number of padding token in the tail of labels

_clip_grad_norm_impl

clip_grad_norm

Common gradient clipping helper.

prepare_for_grad_accumulation

Prepare model parts before starting gradient accumulation.

prepare_for_final_backward

Prepare model parts before the final backward pass.

scale_grads_and_clip_grad_norm

Scale gradients for PP/EP in a single pass, then clip.

move_to_device

API#

nemo_automodel.components.training.utils.count_tail_padding(labels, ignore_label=-100)#

Counts the total number of padding token in the tail of labels

e.g. labels = torch.tensor([ [-100, 1, 1, -100, -100], # 2 tail -100s [-100, -100, 2, 3, 4], # 0 tail -100s [5, 6, -100, -100, -100], # 3 tail -100s ]) count_tail_padding will return 5. Please do note there’s more than 5 ignore labels.

Parameters:
  • labels (torch.Tensor) – the labels

  • ignore_label (int, optional) – ignore label index. Defaults to -100.

Returns:

total number of ignored tokens in the labels input.

Return type:

int

nemo_automodel.components.training.utils._clip_grad_norm_impl(
parameters: torch.Tensor | Iterable[torch.Tensor],
max_norm: float,
norm_type: float = 2.0,
error_if_nonfinite: bool = False,
foreach: bool | None = None,
pp_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
) torch.Tensor#
nemo_automodel.components.training.utils.clip_grad_norm(
max_grad_norm: float | None,
model_parts: list[torch.nn.Module],
*,
norm_type: float = 2.0,
pp_enabled: bool = False,
device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
pp_axis_name: str | None = None,
foreach: bool = True,
)#

Common gradient clipping helper.

Handles all parallelism strategies (TP, PP, EP/MoE) with automatic sharding-aware grouping. Returns the gradient norm as a float, or 0.0 if clipping is skipped.

This function automatically:

  • Groups parameters by sharding pattern (device mesh + placements)

  • Computes norms correctly across different sharding strategies

  • Handles MoE with separate DP/EP meshes

  • Reduces norms across pipeline parallel stages when enabled

Parameters:
  • max_grad_norm – Maximum gradient norm. If None, skips clipping.

  • model_parts – List of model modules to clip.

  • norm_type – Type of norm to use (default: 2.0 for L2).

  • pp_enabled – Whether pipeline parallelism is enabled.

  • device_mesh – Device mesh for parallelism.

  • moe_mesh – MoE-specific device mesh (unused, kept for API compatibility).

  • ep_axis_name – Expert parallel axis name (unused, kept for API compatibility).

  • pp_axis_name – Pipeline parallel axis name.

  • foreach – Whether to use foreach implementation for clipping.

Returns:

Total gradient norm as a float.

nemo_automodel.components.training.utils.prepare_for_grad_accumulation(
model_parts: list[torch.nn.Module],
pp_enabled: bool = False,
)#

Prepare model parts before starting gradient accumulation.

This is typically called once at the start of gradient accumulation to prepare FSDP states for the upcoming forward and backward passes.

Parameters:
  • model_parts – List of model parts (modules) to prepare.

  • pp_enabled – Whether pipeline parallelism is enabled.

nemo_automodel.components.training.utils.prepare_for_final_backward(
model_parts: list[torch.nn.Module],
pp_enabled: bool = False,
)#

Prepare model parts before the final backward pass.

This is typically called before the final gradient accumulation step to prepare FSDP states for gradient synchronization and resharding.

Parameters:
  • model_parts – List of model parts (modules) to prepare.

  • pp_enabled – Whether pipeline parallelism is enabled.

nemo_automodel.components.training.utils.scale_grads_and_clip_grad_norm(
max_grad_norm: float | None,
model_parts: list[torch.nn.Module],
*,
norm_type: float = 2.0,
pp_enabled: bool = False,
device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
moe_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
ep_axis_name: str | None = None,
pp_axis_name: str | None = None,
foreach: bool = True,
num_label_tokens: int | None = None,
dp_group_size: int | None = None,
)#

Scale gradients for PP/EP in a single pass, then clip.

  • PP scaling: divide all local grads by (num_label_tokens / dp_group_size).

  • EP scaling: for parameters on the expert axis, divide grads by (dp_group_size / ep_shard_size).

  • Finally, perform grad clipping with PP/EP-aware reductions.

nemo_automodel.components.training.utils.move_to_device(model, device)#
class nemo_automodel.components.training.utils.ScopedModuleOffloading(model, enabled=False)#

Initialization

__enter__()#
__exit__(exc_type, exc_val, exc_tb)#