`core.optimizer.grad_scaler`#

Megatron grad scaler.

Module Contents#

Classes#

`MegatronGradScaler`	Abstract base class for gradient scalers.
`ConstantGradScaler`	Grad scaler with a fixed scale factor.
`DynamicGradScaler`	Gradient scaler with a dynamic scale factor adjusted during training.

API#

class core.optimizer.grad_scaler.MegatronGradScaler(initial_scale: float)#

Bases: abc.ABC

Abstract base class for gradient scalers.

Parameters:: initial_scale (float) – The initial value for the loss scale.

Initialization

Initialize scale value with the input initial scale.

property scale#

property inv_scale#

abstractmethod update(found_inf: bool)#

abstractmethod state_dict()#

abstractmethod load_state_dict(state_dict: Dict)#

class core.optimizer.grad_scaler.ConstantGradScaler(initial_scale: float)#

Bases: core.optimizer.grad_scaler.MegatronGradScaler

Grad scaler with a fixed scale factor.

The loss scale is never adjusted, regardless of whether NaNs or Infs are detected in the gradients.

Initialization

Initialize scale value with the input initial scale.

update(found_inf: bool)#

state_dict()#

load_state_dict(state_dict)#

class core.optimizer.grad_scaler.DynamicGradScaler( initial_scale: float, min_scale: float, growth_factor: float, backoff_factor: float, growth_interval: int, hysteresis: int, )#

Bases: core.optimizer.grad_scaler.MegatronGradScaler

Gradient scaler with a dynamic scale factor adjusted during training.

This class implements a loss scaling strategy to prevent numerical underflow during mixed-precision training. It reduces the loss scale by a backoff_factor if a hysteresis number of NaNs/Infs are detected in consecutive iterations. Conversely, it increases the loss scale by a growth_factor if no non-finite gradients are seen for a specified growth_interval of iterations.

Parameters:

initial_scale (float) – The starting value for the loss scale.
min_scale (float) – The lower bound for the loss scale.
growth_factor (float) – The multiplier used to increase the scale when gradients are stable. Must be greater than 1.0.
backoff_factor (float) – The multiplier used to decrease the scale when non-finite gradients are detected. Must be between 0.0 and 1.0.
growth_interval (int) – The number of consecutive stable iterations required before increasing the scale.
hysteresis (int) – The number of consecutive non-finite iterations required before decreasing the scale.

Initialization

Grad scaler with dynamic scale that gets adjusted during training.

Parameters:

initial_scale (float) – Initial loss scale value.
min_scale (float) – Minimum loss scale value.
growth_factor (float) – Factor to grow loss scale by if NaNs are not seen in growth_interval training iterations. Must be greater than 1.
backoff_factor (float) – Factor to decrease loss scale by if NaNs are seen in hysteresis consecutive training iterations. Must be between 0 and 1.
growth_interval (int) – Number of training iterations of no NaNs before loss scale is increased.
hysteresis (int) – Number of training iterations of consecutive NaNs before loss scale is decreased.

update(found_inf: bool)#: Updates internal state in grad scaler based on whether NaNs are seen in grads or not.

state_dict()#

load_state_dict(state_dict: Dict)#

core.optimizer.grad_scaler#

Module Contents#

Classes#

API#

`core.optimizer.grad_scaler`#