`nemo_automodel.components.training.utils`#

Module Contents#

Functions#

`count_tail_padding`	Counts the total number of padding token in the tail of labels
`clip_grad_norm_with_ep`
`clip_grad_norm_with_pp`
`clip_grad_norm`	Common gradient clipping helper.
`scale_grads_and_clip_grad_norm`	Scale gradients for PP/EP in a single pass, then clip.

API#

nemo_automodel.components.training.utils.count_tail_padding(labels, ignore_label=-100)#

Counts the total number of padding token in the tail of labels

e.g. labels = torch.tensor([ [-100, 1, 1, -100, -100], # 2 tail -100s [-100, -100, 2, 3, 4], # 0 tail -100s [5, 6, -100, -100, -100], # 3 tail -100s ]) count_tail_padding will return 5. Please do note there’s more than 5 ignore labels.

Parameters:

labels (torch.Tensor) – the labels
ignore_label (int, optional) – ignore label index. Defaults to -100.

Returns:

total number of ignored tokens in the labels input.

Return type:

int

nemo_automodel.components.training.utils.clip_grad_norm_with_ep( parameters: torch.Tensor | Iterable[torch.Tensor], max_norm: float, norm_type: float, error_if_nonfinite: bool, foreach: bool | None, pp_mesh: torch.distributed.device_mesh.DeviceMesh | None, ep_axis_name: str, ) → torch.Tensor#

nemo_automodel.components.training.utils.clip_grad_norm_with_pp( parameters: torch.Tensor | Iterable[torch.Tensor], max_norm: float, norm_type: float = 2.0, error_if_nonfinite: bool = False, foreach: bool | None = None, pp_mesh: torch.distributed.device_mesh.DeviceMesh | None = None, ep_axis_name: str | None = None, ) → torch.Tensor#

nemo_automodel.components.training.utils.clip_grad_norm( max_grad_norm: float | None, model_parts: list[torch.nn.Module], *, norm_type: float = 2.0, pp_enabled: bool = False, device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None, moe_mesh: torch.distributed.device_mesh.DeviceMesh | None = None, ep_axis_name: str | None = None, pp_axis_name: str | None = None, foreach: bool = True, )#

Common gradient clipping helper.

Handles both pipeline-parallel and single-model clipping paths. Returns a float grad norm when available, otherwise 0.0 if clipping is skipped due to constraints.

nemo_automodel.components.training.utils.scale_grads_and_clip_grad_norm( max_grad_norm: float | None, model_parts: list[torch.nn.Module], *, norm_type: float = 2.0, pp_enabled: bool = False, device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None, moe_mesh: torch.distributed.device_mesh.DeviceMesh | None = None, ep_axis_name: str | None = None, pp_axis_name: str | None = None, foreach: bool = True, num_label_tokens: int | None = None, dp_group_size: int | None = None, )#

Scale gradients for PP/EP in a single pass, then clip.

PP scaling: divide all local grads by (num_label_tokens / dp_group_size).
EP scaling: for parameters on the expert axis, divide grads by (dp_group_size / ep_shard_size).
Finally, perform grad clipping with PP/EP-aware reductions.

nemo_automodel.components.training.utils#

Module Contents#

Functions#

API#

`nemo_automodel.components.training.utils`#