core.optimizer.grad_scaler#

Megatron grad scaler.

Module Contents#

Classes#

MegatronGradScaler

Abstract base class for gradient scalers.

ConstantGradScaler

Grad scaler with a fixed scale factor.

DynamicGradScaler

Gradient scaler with a dynamic scale factor adjusted during training.

API#

class core.optimizer.grad_scaler.MegatronGradScaler(initial_scale: float)#

Bases: abc.ABC

Abstract base class for gradient scalers.

Parameters:

initial_scale (float) – The initial value for the loss scale.

Initialization

Initialize scale value with the input initial scale.

property scale#
property inv_scale#
abstractmethod update(found_inf: bool)#
abstractmethod state_dict()#
abstractmethod load_state_dict(state_dict: Dict)#
class core.optimizer.grad_scaler.ConstantGradScaler(initial_scale: float)#

Bases: core.optimizer.grad_scaler.MegatronGradScaler

Grad scaler with a fixed scale factor.

The loss scale is never adjusted, regardless of whether NaNs or Infs are detected in the gradients.

Initialization

Initialize scale value with the input initial scale.

update(found_inf: bool)#
state_dict()#
load_state_dict(state_dict)#
class core.optimizer.grad_scaler.DynamicGradScaler(
initial_scale: float,
min_scale: float,
growth_factor: float,
backoff_factor: float,
growth_interval: int,
hysteresis: int,
)#

Bases: core.optimizer.grad_scaler.MegatronGradScaler

Gradient scaler with a dynamic scale factor adjusted during training.

This class implements a loss scaling strategy to prevent numerical underflow during mixed-precision training. It reduces the loss scale by a backoff_factor if a hysteresis number of NaNs/Infs are detected in consecutive iterations. Conversely, it increases the loss scale by a growth_factor if no non-finite gradients are seen for a specified growth_interval of iterations.

Parameters:
  • initial_scale (float) – The starting value for the loss scale.

  • min_scale (float) – The lower bound for the loss scale.

  • growth_factor (float) – The multiplier used to increase the scale when gradients are stable. Must be greater than 1.0.

  • backoff_factor (float) – The multiplier used to decrease the scale when non-finite gradients are detected. Must be between 0.0 and 1.0.

  • growth_interval (int) – The number of consecutive stable iterations required before increasing the scale.

  • hysteresis (int) – The number of consecutive non-finite iterations required before decreasing the scale.

Initialization

Grad scaler with dynamic scale that gets adjusted during training.

Parameters:
  • initial_scale (float) – Initial loss scale value.

  • min_scale (float) – Minimum loss scale value.

  • growth_factor (float) – Factor to grow loss scale by if NaNs are not seen in growth_interval training iterations. Must be greater than 1.

  • backoff_factor (float) – Factor to decrease loss scale by if NaNs are seen in hysteresis consecutive training iterations. Must be between 0 and 1.

  • growth_interval (int) – Number of training iterations of no NaNs before loss scale is increased.

  • hysteresis (int) – Number of training iterations of consecutive NaNs before loss scale is decreased.

update(found_inf: bool)#

Updates internal state in grad scaler based on whether NaNs are seen in grads or not.

state_dict()#
load_state_dict(state_dict: Dict)#