`nemo_automodel.components.training.utils`#

Module Contents#

Classes#

ScopedModuleOffloading

Functions#

`count_tail_padding`	Counts the total number of padding token in the tail of labels
`_clip_grad_norm_impl`
`clip_grad_norm`	Common gradient clipping helper.
`prepare_for_grad_accumulation`	Prepare model parts before starting gradient accumulation.
`prepare_for_final_backward`	Prepare model parts before the final backward pass.
`scale_grads_and_clip_grad_norm`	Scale gradients for PP/EP in a single pass, then clip.
`move_to_device`

API#

nemo_automodel.components.training.utils.count_tail_padding(labels, ignore_label=-100)#

Counts the total number of padding token in the tail of labels

e.g. labels = torch.tensor([ [-100, 1, 1, -100, -100], # 2 tail -100s [-100, -100, 2, 3, 4], # 0 tail -100s [5, 6, -100, -100, -100], # 3 tail -100s ]) count_tail_padding will return 5. Please do note there’s more than 5 ignore labels.

Parameters:

labels (torch.Tensor) – the labels
ignore_label (int, optional) – ignore label index. Defaults to -100.

Returns:

total number of ignored tokens in the labels input.

Return type:

int

nemo_automodel.components.training.utils._clip_grad_norm_impl( parameters: torch.Tensor | Iterable[torch.Tensor], max_norm: float, norm_type: float = 2.0, error_if_nonfinite: bool = False, foreach: bool | None = None, pp_mesh: torch.distributed.device_mesh.DeviceMesh | None = None, ) → torch.Tensor#

nemo_automodel.components.training.utils.clip_grad_norm( max_grad_norm: float | None, model_parts: list[torch.nn.Module], *, norm_type: float = 2.0, pp_enabled: bool = False, device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None, pp_axis_name: str | None = None, foreach: bool = True, )#

Common gradient clipping helper.

Handles all parallelism strategies (TP, PP, EP/MoE) with automatic sharding-aware grouping. Returns the gradient norm as a float, or 0.0 if clipping is skipped.

This function automatically:

Groups parameters by sharding pattern (device mesh + placements)
Computes norms correctly across different sharding strategies
Handles MoE with separate DP/EP meshes
Reduces norms across pipeline parallel stages when enabled

Parameters:

max_grad_norm – Maximum gradient norm. If None, skips clipping.
model_parts – List of model modules to clip.
norm_type – Type of norm to use (default: 2.0 for L2).
pp_enabled – Whether pipeline parallelism is enabled.
device_mesh – Device mesh for parallelism.
moe_mesh – MoE-specific device mesh (unused, kept for API compatibility).
ep_axis_name – Expert parallel axis name (unused, kept for API compatibility).
pp_axis_name – Pipeline parallel axis name.
foreach – Whether to use foreach implementation for clipping.

Returns:

Total gradient norm as a float.

nemo_automodel.components.training.utils.prepare_for_grad_accumulation( model_parts: list[torch.nn.Module], pp_enabled: bool = False, )#

Prepare model parts before starting gradient accumulation.

This is typically called once at the start of gradient accumulation to prepare FSDP states for the upcoming forward and backward passes.

Parameters:

model_parts – List of model parts (modules) to prepare.
pp_enabled – Whether pipeline parallelism is enabled.

nemo_automodel.components.training.utils.prepare_for_final_backward( model_parts: list[torch.nn.Module], pp_enabled: bool = False, )#

Prepare model parts before the final backward pass.

This is typically called before the final gradient accumulation step to prepare FSDP states for gradient synchronization and resharding.

Parameters:

model_parts – List of model parts (modules) to prepare.
pp_enabled – Whether pipeline parallelism is enabled.

nemo_automodel.components.training.utils.scale_grads_and_clip_grad_norm( max_grad_norm: float | None, model_parts: list[torch.nn.Module], *, norm_type: float = 2.0, pp_enabled: bool = False, device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None, moe_mesh: torch.distributed.device_mesh.DeviceMesh | None = None, ep_axis_name: str | None = None, pp_axis_name: str | None = None, foreach: bool = True, num_label_tokens: int | None = None, dp_group_size: int | None = None, )#

Scale gradients for PP/EP in a single pass, then clip.

PP scaling: divide all local grads by (num_label_tokens / dp_group_size).
EP scaling: for parameters on the expert axis, divide grads by (dp_group_size / ep_shard_size).
Finally, perform grad clipping with PP/EP-aware reductions.

nemo_automodel.components.training.utils.move_to_device(model, device)#

class nemo_automodel.components.training.utils.ScopedModuleOffloading(model, enabled=False)#

Initialization

__enter__()#

__exit__(exc_type, exc_val, exc_tb)#

nemo_automodel.components.training.utils#

Module Contents#

Classes#

Functions#

API#

`nemo_automodel.components.training.utils`#