core.optimizer.optimizer_config#
Module Contents#
Classes#
Key to group parameters by. All such grouped parameters can share an optimizer config specification. |
|
Base optimizer configuration object. |
|
Adam optimizer configuration object. |
|
SGD optimizer configuration object. |
API#
- class core.optimizer.optimizer_config.ParamKey#
Key to group parameters by. All such grouped parameters can share an optimizer config specification.
- name: Union[str, Tuple[str]]#
‘field(…)’
Parameter name(s).
- attr: Union[str, Tuple[str]]#
‘field(…)’
Parameter attribute(s).
- class core.optimizer.optimizer_config.OptimizerConfig#
Base optimizer configuration object.
- lr: Optional[float]#
None
Initial learning rate. Depending on decay style and initial warmup, the learning rate at each iteration would be different.
- min_lr: Optional[float]#
None
Minumum value for learning rate. The scheduler clip values below this threshold.
- weight_decay: float#
0.01
Weight decay coefficient for L2 regularization.
- fp8_recipe: Optional[str]#
None
The type of fp8 recipe will affect the processing logic inside distributed optimizer.
- fp16: bool#
False
If true, train with fp16 mixed precision training. Defaults to False.
- bf16: bool#
False
If true, train with bf16 mixed precision training. Defaults to False.
- reuse_grad_buf_for_mxfp8_param_ag: bool#
False
If true, reuse the grad buffer for param AG when using mxfp8 recipe. Should be set to True only when fp8_recipe is mxfp8 and fp8_param_gather is True.
- params_dtype: torch.dtype#
None
dtype used when intializing the weights. Defaults to torch.float32.
- use_precision_aware_optimizer: bool#
False
If true, allows optimizer-related tensors (master_param, gradients and optimizer states) to be set to lower precision. Defaults to False.
- store_param_remainders: bool#
True
If true, store the 16-bit FP32 parameter remainders in the optimizer state, excluding the 16 bits shared with the BF16 parameters. This lowers GPU memory usage. Defaults to True.
- main_grads_dtype: torch.dtype#
None
dtype of main grads when enabling precision-aware-optimizer
- main_params_dtype: torch.dtype#
None
dtype of main params when enabling precision-aware-optimizer
- exp_avg_dtype: torch.dtype#
None
dtype of exp_avg when enabling precision-aware-optimizer
- exp_avg_sq_dtype: torch.dtype#
None
dtype of exp_avg_sq when enabling precision-aware-optimizer
- optimizer: str#
‘adam’
Optimizer name. NOTE: Deprecated, use individual optimizer classes instead.
- loss_scale: Optional[float]#
None
Static loss scaling, positive power of 2 values can improve fp16 convergence. If None, dynamic loss scaling is used.
- initial_loss_scale: float#
None
Initial loss-scale for dynamic loss scaling.
- min_loss_scale: float#
1.0
Minimum loss scale for dynamic loss scaling.
- loss_scale_window: float#
1000
Window over which to raise/lower dynamic scale.
- hysteresis: int#
2
Hysteresis for dynamic loss scaling.
- adam_beta1: float#
0.9
First coefficient for computing running averages of gradient and its square in Adam optimizer.
- adam_beta2: float#
0.999
Second coefficient for computing running averages of gradient and its square in Adam optimizer.
- adam_eps: float#
1e-08
Term added to the denominator to improve numerical stability in Adam optimizer.
- decoupled_weight_decay: bool#
True
If true, decouples weight decay from the gradient update, equivalent to AdamW. If false, original Adam update rule will be used. Defaults to True.
- sgd_momentum: float#
0.9
Momentum factor for SGD optimizer.
- use_distributed_optimizer: bool#
False
Distribute optimizer state over data-parallel replicas.
- overlap_param_gather: bool#
False
If true, overlap param all-gather with forward compute. This argument is intended to have the same value as the “overlap_param_gather” argument in the “distributed_data_parallel_config.py” file. In the optimizer, this argument is only used when “reuse_grad_buf_for_mxfp8_param_ag=True & fp8_param_gather=True”.
- overlap_param_gather_with_optimizer_step: bool#
False
If true, overlap param all-gather of first bucket with optimizer step.
- optimizer_cpu_offload: bool#
False
If True, offload optimizer states tensor and compute to CPU.
- optimizer_offload_fraction: float#
0.0
Specifies the fraction of optimizer states to offload from GPU memory to CPU.
- use_torch_optimizer_for_cpu_offload: bool#
False
If True, use torch.optim.Optimizer for CPU offload.
- overlap_cpu_optimizer_d2h_h2d: bool#
False
When set to
True, this flag enables overlapping of the CPU optimizer update process with the data transfer operations. This can help improve overall training efficiency by reducing idle time during data movement, allowing the optimizer to perform updates while gradients and parameters are being transferred between devices.
- pin_cpu_grads: bool#
True
If True, pin the optimizer gradients to CPU memory.
- pin_cpu_params: bool#
True
If True, pin the optimizer parameters to CPU memory.
- clip_grad: float#
1.0
Gradient clipping based on global L2 norm.
- log_num_zeros_in_grad: bool#
False
If true, calculate and log the number of zeros in gradient.
- barrier_with_L1_time: bool#
False
If true, use barrier with level 1 time measurements.
- timers: Optional[Callable]#
None
Function to get timers.
- config_logger_dir: str = <Multiline-String>#
When non-empty, dumps entry-point configs to config_logger_dir
- __post_init__()#
Check the validity of the config.
- class core.optimizer.optimizer_config.AdamOptimizerConfig#
Bases:
core.optimizer.optimizer_config.OptimizerConfigAdam optimizer configuration object.
- optimizer: str#
‘adam’
Optimizer name.
- adam_beta1: float#
0.9
First coefficient for computing running averages of gradient and its square in Adam optimizer.
- adam_beta2: float#
0.999
Second coefficient for computing running averages of gradient and its square in Adam optimizer.
- adam_eps: float#
1e-08
Term added to the denominator to improve numerical stability in Adam optimizer.
- class core.optimizer.optimizer_config.SGDOptimizerConfig#
Bases:
core.optimizer.optimizer_config.OptimizerConfigSGD optimizer configuration object.
- optimizer: str#
‘sgd’
Optimizer name.
- sgd_momentum: float#
0.9
Momentum factor for SGD optimizer.