core.optimizer.optimizer_config#

Module Contents#

Classes#

ParamPredicate

Wraps a matching function to make it hashable for ParamKey. .. rubric:: Example

ParamWithNamePredicate

Wraps a matching function to make it hashable for ParamKey. .. rubric:: Example

ParamKey

Key to group parameters by. All such grouped parameters can share an optimizer config specification.

OptimizerConfig

Configuration object for Megatron optimizers.

Data#

API#

class core.optimizer.optimizer_config.ParamPredicate#

Wraps a matching function to make it hashable for ParamKey. .. rubric:: Example

shape_1_param = ParamPredicate(name=”s1”, fn=lambda param: len(param.shape) == 1) shape_1_param(torch.empty(10)) True shape_1_param_copy = ParamPredicate(name=”s1”, fn=lambda param: len(param.shape) == 1) shape_1_param == shape_1_param_copy # name is used to match True {shape_1_param, shape_1_param_copy} == {shape_1_param} # set hashing works properly

.. note::

hash and eq are automatically generated by @dataclass(frozen=True) based solely on ‘name’ because we set compare=False/hash=False on ‘fn’.

name: str#

None

fn: Callable[[torch.nn.Parameter], bool]#

‘field(…)’

__call__(param: torch.nn.Parameter) bool#
class core.optimizer.optimizer_config.ParamWithNamePredicate#

Wraps a matching function to make it hashable for ParamKey. .. rubric:: Example

shape_1_not_qkln_param = ParamWithNamePredicate( name=”s1_not_qkln”, fn=lambda param, name: ( len(param.shape) == 1 or name.endswith(“.bias”) and not (“q_layernorm.” in name or “k_layernorm.” in name) ) ) shape_1_not_qkln_param(torch.empty(10), “interesting.bias”) True shape_1_not_qkln_param(torch.empty(10), “interesting.q_layernorm.bias”) False

.. note::

hash and eq are automatically generated by @dataclass(frozen=True) based solely on ‘name’ because we set compare=False/hash=False on ‘fn’.

name: str#

None

fn: Callable[[torch.nn.Parameter, str], bool]#

‘field(…)’

__call__(param: torch.nn.Parameter, name: str) bool#
class core.optimizer.optimizer_config.ParamKey#

Key to group parameters by. All such grouped parameters can share an optimizer config specification.

name: Union[str, Tuple[str]]#

‘field(…)’

Parameter name(s), will use unix filesystem path syntax for matching.

attr: Union[str, Tuple[str]]#

‘field(…)’

Parameter attribute(s).

predicate: Union[core.optimizer.optimizer_config.ParamPredicate, Tuple[core.optimizer.optimizer_config.ParamPredicate]]#

‘field(…)’

Predicate(s) to match parameters by. If multiple predicates are provided, any must match.

with_name_predicate: Union[core.optimizer.optimizer_config.ParamWithNamePredicate, Tuple[core.optimizer.optimizer_config.ParamWithNamePredicate]]#

‘field(…)’

Predicate(s) to match parameters with their name. If multiple predicates are provided, any must match. This is useful if you need to filter out some parameters from an otherwise positive match by their name.

matches(param: torch.nn.Parameter, param_name: str) bool#

Returns true if passed-in parameter (with name) matches param_key.

Parameters:
  • param (torch.nn.Parameter) – Handle to parameter object.

  • param_name (str) – Name of parameter in underlying PyTorch module.

Returns:

True if parameter matches passed-in param_key.

Return type:

bool

class core.optimizer.optimizer_config.OptimizerConfig#

Configuration object for Megatron optimizers.

lr: Optional[float]#

None

Initial learning rate. Depending on decay style and initial warmup, the learning rate at each iteration would be different.

min_lr: Optional[float]#

None

Minumum value for learning rate. The scheduler clip values below this threshold.

decoupled_lr: Optional[float]#

None

Separate learning rate for the input and output layer.

decoupled_min_lr: Optional[float]#

None

Minimum value for learning rate for the input and output layer. The scheduler clip values below this threshold.

weight_decay: float#

0.01

Weight decay coefficient for L2 regularization.

apply_wd_to_qk_layernorm: bool#

False

If true, apply weight decay to qk layernorm as a special case.

fp8_recipe: Optional[str]#

None

The type of fp8 recipe will affect the processing logic inside distributed optimizer.

fp16: bool#

False

If true, train with fp16 mixed precision training. Defaults to False.

bf16: bool#

False

If true, train with bf16 mixed precision training. Defaults to False.

reuse_grad_buf_for_mxfp8_param_ag: bool#

False

If true, reuse the grad buffer for param AG when using mxfp8 recipe. Should be set to True only when fp8_recipe is mxfp8 and fp8_param_gather is True.

params_dtype: torch.dtype#

None

dtype used when intializing the weights. Defaults to torch.float32.

use_precision_aware_optimizer: bool#

False

If true, allows optimizer-related tensors (master_param, gradients and optimizer states) to be set to lower precision. Defaults to False.

store_param_remainders: bool#

True

If true, store the 16-bit FP32 parameter remainders in the optimizer state, excluding the 16 bits shared with the BF16 parameters. This lowers GPU memory usage. Defaults to True.

main_grads_dtype: torch.dtype#

None

dtype of main grads when enabling precision-aware-optimizer

main_params_dtype: torch.dtype#

None

dtype of main params when enabling precision-aware-optimizer

exp_avg_dtype: torch.dtype#

None

dtype of exp_avg when enabling precision-aware-optimizer

exp_avg_sq_dtype: torch.dtype#

None

dtype of exp_avg_sq when enabling precision-aware-optimizer

optimizer: str#

‘adam’

Optimizer name (e.g., ‘adam’, ‘sgd’, ‘muon’). Can be overridden per-parameter group via config_overrides to use different optimizers for different parameters.

loss_scale: Optional[float]#

None

Static loss scaling, positive power of 2 values can improve fp16 convergence. If None, dynamic loss scaling is used.

initial_loss_scale: float#

None

Initial loss-scale for dynamic loss scaling.

min_loss_scale: float#

1.0

Minimum loss scale for dynamic loss scaling.

loss_scale_window: float#

1000

Window over which to raise/lower dynamic scale.

hysteresis: int#

2

Hysteresis for dynamic loss scaling.

adam_beta1: float#

0.9

First coefficient for computing running averages of gradient and its square in Adam optimizer.

adam_beta2: float#

0.999

Second coefficient for computing running averages of gradient and its square in Adam optimizer.

adam_eps: float#

1e-08

Term added to the denominator to improve numerical stability in Adam optimizer.

decoupled_weight_decay: bool#

True

If true, decouples weight decay from the gradient update, equivalent to AdamW. If false, original Adam update rule will be used. Defaults to True.

sgd_momentum: float#

0.9

Momentum factor for SGD optimizer.

muon_momentum: float#

0.95

The momentum used by the internal SGD in Muon optimizer.

muon_split_qkv: bool#

True

Whether to split QKV parameters for Muon optimizer.

muon_nesterov: bool#

False

Whether to use Nesterov-style momentum in the internal SGD.

muon_scale_mode: str#

‘spectral’

The mode to use for the scale factor. Defaults to “spectral”.

muon_fp32_matmul_prec: str#

‘medium’

The precision to use for the fp32 matmul. Defaults to “medium”.

muon_coefficient_type: str#

‘quintic’

Newton-Schulz coefficient type for the Muon optimizer. Valid types are discovered dynamically from the installed emerging_optimizers package. Defaults to “quintic”.

muon_num_ns_steps: int#

5

The number of iteration steps to use in the Newton-Schulz iteration.

muon_tp_mode: str#

‘blockwise’

How to perform NS calculation for tensor parallel weights. Defaults to “blockwise”.

muon_extra_scale_factor: float#

1.0

Additional scale factor for the muon update.

muon_scalar_optimizer: str#

‘adam’

Optimizer for nonlinear parameters (embeddings, biases, norms) when using muon. One of ‘adam’ or ‘lion’. Defaults to ‘adam’.

lion_beta1: float#

0.95

First beta coefficient for Lion optimizer (used in sign update). Defaults to 0.95.

lion_beta2: float#

0.98

Second beta coefficient for Lion optimizer (used in momentum EMA update). Defaults to 0.98.

soap_shampoo_beta: float#

0.95

The beta parameter for the Shampoo preconditioner.

soap_precondition_frequency: int#

1

The frequency of the Shampoo preconditioner.

soap_use_kl_shampoo: bool#

True

Whether to use the KL-Shampoo preconditioner.

adaptive_muon_moment2_method: str#

‘adamuon’

The method to use for the moment2 update in Adaptive Muon optimizer.

adaptive_muon_beta2: float#

0.95

The beta2 parameter for the Adaptive Muon optimizer.

adaptive_muon_eps: float#

1e-08

The eps parameter for the Adaptive Muon optimizer.

use_distributed_optimizer: bool#

False

Distribute optimizer state over data-parallel replicas.

use_layer_wise_distributed_optimizer: bool#

False

Use :class:LayerWiseDistributedOptimizer for emerging optimizers (e.g. Muon). When set via --use-distributed-optimizer with an emerging optimizer, the training arguments layer sets this flag and resets use_distributed_optimizer to False so that the standard distributed-optimizer path is not triggered.

overlap_param_gather: bool#

False

If true, overlap param all-gather with forward compute. This argument is intended to have the same value as the “overlap_param_gather” argument in the “distributed_data_parallel_config.py” file. In the optimizer, this argument is only used when “reuse_grad_buf_for_mxfp8_param_ag=True & fp8_param_gather=True”.

overlap_param_gather_with_optimizer_step: bool#

False

If true, overlap param all-gather of first bucket with optimizer step.

optimizer_cpu_offload: bool#

False

If True, offload optimizer states tensor and compute to CPU.

optimizer_offload_fraction: float#

0.0

Specifies the fraction of optimizer states to offload from GPU memory to CPU.

use_torch_optimizer_for_cpu_offload: bool#

False

If True, use torch.optim.Optimizer for CPU offload.

overlap_cpu_optimizer_d2h_h2d: bool#

False

When set to True, this flag enables overlapping of the CPU optimizer update process with the data transfer operations. This can help improve overall training efficiency by reducing idle time during data movement, allowing the optimizer to perform updates while gradients and parameters are being transferred between devices.

pin_cpu_grads: bool#

True

If True, pin the optimizer gradients to CPU memory.

pin_cpu_params: bool#

True

If True, pin the optimizer parameters to CPU memory.

clip_grad: float#

1.0

Gradient clipping based on global L2 norm.

log_num_zeros_in_grad: bool#

False

If true, calculate and log the number of zeros in gradient.

barrier_with_L1_time: bool#

False

If true, use barrier with level 1 time measurements.

timers: Optional[Callable]#

None

Function to get timers.

config_logger_dir: str = <Multiline-String>#

When non-empty, dumps entry-point configs to config_logger_dir

optimizer_cuda_graph: bool#

False

If true, enables CUDA graph for optimizer step.

__post_init__()#

Check the validity of the config.

core.optimizer.optimizer_config.AdamOptimizerConfig#

None

core.optimizer.optimizer_config.SGDOptimizerConfig#

None