core.optimizer.grad_scaler#
Megatron grad scaler.
Module Contents#
Classes#
Constant grad scaler (loss scale is never adjusted regardless of NaNs seen in gradients). |
|
Grad scaler with dynamic scale that gets adjusted during training. |
API#
- class core.optimizer.grad_scaler.MegatronGradScaler(initial_scale: float)#
Bases:
abc.ABC- property scale#
- property inv_scale#
- abstractmethod update(found_inf: bool)#
- abstractmethod state_dict()#
- abstractmethod load_state_dict(state_dict: Dict)#
- class core.optimizer.grad_scaler.ConstantGradScaler(initial_scale: float)#
Bases:
core.optimizer.grad_scaler.MegatronGradScalerConstant grad scaler (loss scale is never adjusted regardless of NaNs seen in gradients).
Initialization
Initialize scale value with the input initial scale.
- update(found_inf: bool)#
- state_dict()#
- load_state_dict(state_dict)#
- class core.optimizer.grad_scaler.DynamicGradScaler(
- initial_scale: float,
- min_scale: float,
- growth_factor: float,
- backoff_factor: float,
- growth_interval: int,
- hysteresis: int,
Bases:
core.optimizer.grad_scaler.MegatronGradScalerGrad scaler with dynamic scale that gets adjusted during training.
Reduces loss scale by
backoff_factorifhysteresisnumber of NaNs are seen in a row. Increases loss scale bygrowth_factorif NaNs are not seen forgrowth_intervaliterations.Initialization
Grad scaler with dynamic scale that gets adjusted during training.
- Parameters:
initial_scale (float) – Initial loss scale value.
min_scale (float) – Minimum loss scale value.
growth_factor (float) – Factor to grow loss scale by if NaNs are not seen in
growth_intervaltraining iterations. Must be greater than 1.backoff_factor (float) – Factor to decrease loss scale by if NaNs are seen in
hysteresisconsecutive training iterations. Must be between 0 and 1.growth_interval (int) – Number of training iterations of no NaNs before loss scale is increased.
hysteresis (int) – Number of training iterations of consecutive NaNs before loss scale is decreased.
- update(found_inf: bool)#
Updates internal state in grad scaler based on whether NaNs are seen in grads or not.
- state_dict()#
- load_state_dict(state_dict: Dict)#