nemo_automodel.components.training.utils
nemo_automodel.components.training.utils
Module Contents
Classes
Functions
Data
API
Context manager that temporarily moves a module between CPU and CUDA.
Common gradient clipping helper.
Handles all parallelism strategies (TP, PP, EP/MoE) with automatic sharding-aware grouping. Returns the gradient norm as a float, or 0.0 if clipping is skipped.
This function automatically:
- Groups parameters by sharding pattern (device mesh + placements)
- Computes norms correctly across different sharding strategies
- Handles MoE with separate DP/EP meshes
- Reduces norms across pipeline parallel stages when enabled
Parameters:
Maximum gradient norm. If None, skips clipping.
List of model modules to clip.
Type of norm to use (default: 2.0 for L2).
Whether pipeline parallelism is enabled.
Device mesh for parallelism.
MoE-specific device mesh (unused, kept for API compatibility).
Expert parallel axis name (unused, kept for API compatibility).
Pipeline parallel axis name.
Whether to use foreach implementation for clipping.
Use PyTorch’s optimized regular-tensor clipping path when possible.
Returns:
Total gradient norm as a float.
Counts the total number of padding token in the tail of labels
e.g. labels = torch.tensor([ [-100, 1, 1, -100, -100], # 2 tail -100s [-100, -100, 2, 3, 4], # 0 tail -100s [5, 6, -100, -100, -100], # 3 tail -100s ]) count_tail_padding will return 5. Please do note there’s more than 5 ignore labels. Args: labels (torch.Tensor): the labels ignore_label (int, optional): ignore label index. Defaults to -100.
Returns:
total number of ignored tokens in the labels input.
Move a model and its buffers to a device and release stale CUDA cache.
Disable first-microbatch flag after the first forward-backward pass.
Called after the first microbatch in gradient accumulation so that subsequent microbatches reuse cached FP8 weights instead of re-quantizing.
Prepare model parts before the final backward pass.
This is typically called before the final gradient accumulation step to prepare FSDP states for gradient synchronization and resharding.
Parameters:
List of model parts (modules) to prepare.
Whether pipeline parallelism is enabled.
Prepare model parts before starting gradient accumulation.
This is typically called once at the start of gradient accumulation to prepare FSDP states for the upcoming forward and backward passes.
Parameters:
List of model parts (modules) to prepare.
Whether pipeline parallelism is enabled.
Scale gradients for PP/EP in a single pass, then clip.
- PP scaling: divide all local grads by (num_label_tokens / dp_group_size).
- EP scaling: for parameters on the expert axis, divide grads by (dp_group_size / ep_shard_size).
- Finally, perform grad clipping with PP/EP-aware reductions.