`core.optimizer.clip_grads`#

Gradient clipping.

Module Contents#

Functions#

`get_grad_norm_fp32`	Calculate the norm of gradients in fp32.
`clip_grad_by_total_norm_fp32`	Clips gradient of an iterable of parameters in fp32 by total norm.
`count_zeros_fp32`	Counts the number of zeros in gradients associated with the passed-in list of parameters.

API#

core.optimizer.clip_grads.get_grad_norm_fp32( grads_for_norm: Union[List[torch.Tensor], torch.Tensor], norm_type: Union[int, float] = 2, grad_stats_parallel_group: Optional[torch.distributed.ProcessGroup] = None, ) → float#

Calculate the norm of gradients in fp32.

This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and added functionality to handle model parallel parameters.

Parameters:

grads_for_norm (Iterable[Tensor] or Tensor) – an iterable of Tensors or a single Tensor that will be used for calculating the grad norm.
norm_type (float or int) – type of the used p-norm. Can be 'inf' for infinity norm.
grad_stats_parallel_group (group) – Process group for reducing the grad norms. This is generally the model-parallel group for non-distributed optimizers, and the entire world for the distributed optimizer.

Returns:

Total norm of the parameters (viewed as a single vector).

core.optimizer.clip_grads.clip_grad_by_total_norm_fp32( parameters: Union[List[torch.Tensor], torch.Tensor], max_norm: Union[int, float], total_norm: float, use_decoupled_grad: bool = False, )#

Clips gradient of an iterable of parameters in fp32 by total norm.

Note that the gradients are modified in place.

Parameters:

parameters (Iterable[Tensor] or Tensor) – an iterable of Tensors or a single Tensor that will have gradients normalized.
max_norm (float or int) – max norm of the gradients.
total_norm (float) – total norm of the gradients.
use_decoupled_grad (bool, optional) – whether to read grad from “.grad” or “.decoupled_grad”, default value is False.

core.optimizer.clip_grads.count_zeros_fp32( parameters: Union[List[torch.Tensor], torch.Tensor], grad_stats_parallel_group: torch.distributed.ProcessGroup, use_decoupled_grad: bool = False, ) → float#

Counts the number of zeros in gradients associated with the passed-in list of parameters.

Parameters:

parameters (Iterable[Tensor] or Tensor) – an iterable of Tensors or a single Tensor that will have the number of zeros in its corresponding gradient counted.
grad_stats_parallel_group (group) – Process group for reducing the num_zeros count. This is generally the model-parallel group for non-distributed optimizers, and the entire world for the distributed optimizer.
use_decoupled_grad (bool, optional) – default value is False.

core.optimizer.clip_grads#

Module Contents#

Functions#

API#

`core.optimizer.clip_grads`#