core.optimizer.grad_scaler#
Megatron grad scaler.
Module Contents#
Classes#
Abstract base class for gradient scalers. |
|
Grad scaler with a fixed scale factor. |
|
Gradient scaler with a dynamic scale factor adjusted during training. |
API#
- class core.optimizer.grad_scaler.MegatronGradScaler(initial_scale: float)#
Bases:
abc.ABCAbstract base class for gradient scalers.
- Parameters:
initial_scale (float) – The initial value for the loss scale.
Initialization
Initialize scale value with the input initial scale.
- property scale#
- property inv_scale#
- abstractmethod update(found_inf: bool)#
- abstractmethod state_dict()#
- abstractmethod load_state_dict(state_dict: Dict)#
- class core.optimizer.grad_scaler.ConstantGradScaler(initial_scale: float)#
Bases:
core.optimizer.grad_scaler.MegatronGradScalerGrad scaler with a fixed scale factor.
The loss scale is never adjusted, regardless of whether NaNs or Infs are detected in the gradients.
Initialization
Initialize scale value with the input initial scale.
- update(found_inf: bool)#
- state_dict()#
- load_state_dict(state_dict)#
- class core.optimizer.grad_scaler.DynamicGradScaler(
- initial_scale: float,
- min_scale: float,
- growth_factor: float,
- backoff_factor: float,
- growth_interval: int,
- hysteresis: int,
Bases:
core.optimizer.grad_scaler.MegatronGradScalerGradient scaler with a dynamic scale factor adjusted during training.
This class implements a loss scaling strategy to prevent numerical underflow during mixed-precision training. It reduces the loss scale by a
backoff_factorif ahysteresisnumber of NaNs/Infs are detected in consecutive iterations. Conversely, it increases the loss scale by agrowth_factorif no non-finite gradients are seen for a specifiedgrowth_intervalof iterations.- Parameters:
initial_scale (float) – The starting value for the loss scale.
min_scale (float) – The lower bound for the loss scale.
growth_factor (float) – The multiplier used to increase the scale when gradients are stable. Must be greater than 1.0.
backoff_factor (float) – The multiplier used to decrease the scale when non-finite gradients are detected. Must be between 0.0 and 1.0.
growth_interval (int) – The number of consecutive stable iterations required before increasing the scale.
hysteresis (int) – The number of consecutive non-finite iterations required before decreasing the scale.
Initialization
Grad scaler with dynamic scale that gets adjusted during training.
- Parameters:
initial_scale (float) – Initial loss scale value.
min_scale (float) – Minimum loss scale value.
growth_factor (float) – Factor to grow loss scale by if NaNs are not seen in
growth_intervaltraining iterations. Must be greater than 1.backoff_factor (float) – Factor to decrease loss scale by if NaNs are seen in
hysteresisconsecutive training iterations. Must be between 0 and 1.growth_interval (int) – Number of training iterations of no NaNs before loss scale is increased.
hysteresis (int) – Number of training iterations of consecutive NaNs before loss scale is decreased.
- update(found_inf: bool)#
Updates internal state in grad scaler based on whether NaNs are seen in grads or not.
- state_dict()#
- load_state_dict(state_dict: Dict)#