core.optimizer.grad_scaler#

Megatron grad scaler.

Module Contents#

Classes#

MegatronGradScaler

ConstantGradScaler

Constant grad scaler (loss scale is never adjusted regardless of NaNs seen in gradients).

DynamicGradScaler

Grad scaler with dynamic scale that gets adjusted during training.

API#

class core.optimizer.grad_scaler.MegatronGradScaler(initial_scale: float)#

Bases: abc.ABC

property scale#
property inv_scale#
abstractmethod update(found_inf: bool)#
abstractmethod state_dict()#
abstractmethod load_state_dict(state_dict: Dict)#
class core.optimizer.grad_scaler.ConstantGradScaler(initial_scale: float)#

Bases: core.optimizer.grad_scaler.MegatronGradScaler

Constant grad scaler (loss scale is never adjusted regardless of NaNs seen in gradients).

Initialization

Initialize scale value with the input initial scale.

update(found_inf: bool)#
state_dict()#
load_state_dict(state_dict)#
class core.optimizer.grad_scaler.DynamicGradScaler(
initial_scale: float,
min_scale: float,
growth_factor: float,
backoff_factor: float,
growth_interval: int,
hysteresis: int,
)#

Bases: core.optimizer.grad_scaler.MegatronGradScaler

Grad scaler with dynamic scale that gets adjusted during training.

Reduces loss scale by backoff_factor if hysteresis number of NaNs are seen in a row. Increases loss scale by growth_factor if NaNs are not seen for growth_interval iterations.

Initialization

Grad scaler with dynamic scale that gets adjusted during training.

Parameters:
  • initial_scale (float) – Initial loss scale value.

  • min_scale (float) – Minimum loss scale value.

  • growth_factor (float) – Factor to grow loss scale by if NaNs are not seen in growth_interval training iterations. Must be greater than 1.

  • backoff_factor (float) – Factor to decrease loss scale by if NaNs are seen in hysteresis consecutive training iterations. Must be between 0 and 1.

  • growth_interval (int) – Number of training iterations of no NaNs before loss scale is increased.

  • hysteresis (int) – Number of training iterations of consecutive NaNs before loss scale is decreased.

update(found_inf: bool)#

Updates internal state in grad scaler based on whether NaNs are seen in grads or not.

state_dict()#
load_state_dict(state_dict: Dict)#