nemo_automodel.components.training.utils#
Module Contents#
Classes#
Functions#
Counts the total number of padding token in the tail of labels |
|
Common gradient clipping helper. |
|
Prepare model parts before starting gradient accumulation. |
|
Prepare model parts before the final backward pass. |
|
Scale gradients for PP/EP in a single pass, then clip. |
|
API#
- nemo_automodel.components.training.utils.count_tail_padding(labels, ignore_label=-100)#
Counts the total number of padding token in the tail of labels
e.g. labels = torch.tensor([ [-100, 1, 1, -100, -100], # 2 tail -100s [-100, -100, 2, 3, 4], # 0 tail -100s [5, 6, -100, -100, -100], # 3 tail -100s ]) count_tail_padding will return 5. Please do note thereβs more than 5 ignore labels.
- Parameters:
labels (torch.Tensor) β the labels
ignore_label (int, optional) β ignore label index. Defaults to -100.
- Returns:
total number of ignored tokens in the
labelsinput.- Return type:
int
- nemo_automodel.components.training.utils._clip_grad_norm_impl(
- parameters: torch.Tensor | Iterable[torch.Tensor],
- max_norm: float,
- norm_type: float = 2.0,
- error_if_nonfinite: bool = False,
- foreach: bool | None = None,
- pp_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
- nemo_automodel.components.training.utils.clip_grad_norm(
- max_grad_norm: float | None,
- model_parts: list[torch.nn.Module],
- *,
- norm_type: float = 2.0,
- pp_enabled: bool = False,
- device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
- pp_axis_name: str | None = None,
- foreach: bool = True,
Common gradient clipping helper.
Handles all parallelism strategies (TP, PP, EP/MoE) with automatic sharding-aware grouping. Returns the gradient norm as a float, or 0.0 if clipping is skipped.
This function automatically:
Groups parameters by sharding pattern (device mesh + placements)
Computes norms correctly across different sharding strategies
Handles MoE with separate DP/EP meshes
Reduces norms across pipeline parallel stages when enabled
- Parameters:
max_grad_norm β Maximum gradient norm. If None, skips clipping.
model_parts β List of model modules to clip.
norm_type β Type of norm to use (default: 2.0 for L2).
pp_enabled β Whether pipeline parallelism is enabled.
device_mesh β Device mesh for parallelism.
moe_mesh β MoE-specific device mesh (unused, kept for API compatibility).
ep_axis_name β Expert parallel axis name (unused, kept for API compatibility).
pp_axis_name β Pipeline parallel axis name.
foreach β Whether to use foreach implementation for clipping.
- Returns:
Total gradient norm as a float.
- nemo_automodel.components.training.utils.prepare_for_grad_accumulation(
- model_parts: list[torch.nn.Module],
- pp_enabled: bool = False,
Prepare model parts before starting gradient accumulation.
This is typically called once at the start of gradient accumulation to prepare FSDP states for the upcoming forward and backward passes.
- Parameters:
model_parts β List of model parts (modules) to prepare.
pp_enabled β Whether pipeline parallelism is enabled.
- nemo_automodel.components.training.utils.prepare_for_final_backward(
- model_parts: list[torch.nn.Module],
- pp_enabled: bool = False,
Prepare model parts before the final backward pass.
This is typically called before the final gradient accumulation step to prepare FSDP states for gradient synchronization and resharding.
- Parameters:
model_parts β List of model parts (modules) to prepare.
pp_enabled β Whether pipeline parallelism is enabled.
- nemo_automodel.components.training.utils.scale_grads_and_clip_grad_norm(
- max_grad_norm: float | None,
- model_parts: list[torch.nn.Module],
- *,
- norm_type: float = 2.0,
- pp_enabled: bool = False,
- device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
- moe_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
- ep_axis_name: str | None = None,
- pp_axis_name: str | None = None,
- foreach: bool = True,
- num_label_tokens: int | None = None,
- dp_group_size: int | None = None,
Scale gradients for PP/EP in a single pass, then clip.
PP scaling: divide all local grads by (num_label_tokens / dp_group_size).
EP scaling: for parameters on the expert axis, divide grads by (dp_group_size / ep_shard_size).
Finally, perform grad clipping with PP/EP-aware reductions.
- nemo_automodel.components.training.utils.move_to_device(model, device)#