Optimizer and Scheduler Configuration#

The optimizer and scheduler configurations control optimization algorithms, learning rate schedules, and weight decay strategies.

OptimizerConfig (from Megatron Core)#

The OptimizerConfig contains all parameters for the optimization algorithm and comes directly from Megatron Core. Key parameters include:

Parameter	Type	Description
`optimizer`	`str`	Optimizer type (“adam”, “sgd”, etc.)
`lr`	`float`	Base learning rate
`min_lr`	`float`	Minimum learning rate for decay schedules
`weight_decay`	`float`	L2 regularization coefficient
`adam_beta1`	`float`	Adam optimizer beta1 parameter
`adam_beta2`	`float`	Adam optimizer beta2 parameter
`adam_eps`	`float`	Adam optimizer epsilon parameter
`clip_grad`	`float`	Gradient clipping threshold
`use_distributed_optimizer`	`bool`	Enable distributed optimizer for memory efficiency
`overlap_grad_reduce`	`bool`	Overlap gradient reduction with computation
`overlap_param_gather`	`bool`	Overlap parameter gathering with computation
`bf16`	`bool`	Use BF16 precision for training
`fp16`	`bool`	Use FP16 precision for training

SchedulerConfig#

The SchedulerConfig controls learning rate scheduling and weight decay progression throughout training.

Learning Rate Scheduling#

Parameter	Type	Default	Description
`lr_decay_style`	`Literal["constant", "linear", "cosine", "inverse-square-root", "WSD"]`	`"linear"`	Learning rate decay function
`lr_decay_iters`	`Optional[int]`	`None`	Iterations to decay LR over (defaults to `train_iters`)
`lr_warmup_iters`	`int`	`0`	Iterations to linearly warmup learning rate
`lr_warmup_fraction`	`Optional[float]`	`None`	Fraction of decay iterations to use for warmup
`lr_warmup_init`	`float`	`0.0`	Initial learning rate for warmup phase

WSD (Warmup-Stable-Decay) Scheduling#

Parameter	Type	Default	Description
`lr_wsd_decay_style`	`Literal["exponential", "linear", "cosine"]`	`"exponential"`	Decay style for WSD annealing phase
`lr_wsd_decay_iters`	`Optional[int]`	`None`	Iterations for WSD annealing phase

Weight Decay Scheduling#

Parameters for controlling the progression of weight decay during training, including start and end values and the scheduling strategy:

Parameter	Type	Default	Description
`start_weight_decay`	`Optional[float]`	`None`	Initial weight decay coefficient
`end_weight_decay`	`Optional[float]`	`None`	Final weight decay coefficient
`weight_decay_incr_style`	`Literal["constant", "linear", "cosine"]`	`"constant"`	Weight decay progression style

Checkpoint Integration#

Parameters for managing how scheduler settings are applied during checkpoint loading, allowing control over whether to prioritize config values or restore from saved state:

Parameter	Type	Default	Description
`override_opt_param_scheduler`	`bool`	`False`	Reset scheduler values from config, ignoring checkpoint
`use_checkpoint_opt_param_scheduler`	`bool`	`False`	Use scheduler values from checkpoint, ignoring config

Computed Fields#

These fields are automatically calculated during configuration validation and help align training schedules with the configured batch size and iteration counts:

Field	Description
`lr_warmup_steps`	Total steps for warmup (calculated from iterations and batch size)
`lr_decay_steps`	Total steps for decay (calculated from iterations and batch size)
`wd_incr_steps`	Total steps for weight decay progression
`wsd_decay_steps`	Total steps for WSD annealing phase

Learning Rate Schedules#

The following scheduling strategies define how the learning rate evolves during training, each suited to different convergence behaviors and model types:

Schedule Type	Description
Constant	Learning rate remains fixed throughout training.
Linear	Learning rate decreases linearly from the base LR to the minimum LR.
Cosine	Learning rate follows a cosine decay curve from base LR to minimum LR.
Inverse Square Root	Learning rate decays proportionally to the inverse square root of the step.

WSD (Warmup-Stable-Decay)#

The WSD schedule divides learning rate progression into three distinct phases, offering fine-grained control over early ramp-up, mid-training stability, and final decay:

Phase	Description
Warmup	Learning rate increases linearly from initial value to base LR.
Stable	Learning rate remains constant at base LR.
Decay	Learning rate decays to minimum LR using a specified style (e.g., exponential, linear, cosine).

Weight Decay Scheduling#

These scheduling options control how the weight decay coefficient changes over time, allowing for regularization strategies that adapt to different training phases:

Schedule Type	Description
Constant	Fixed weight decay throughout training.
Linear	Linear progression from start to end weight decay.
Cosine	Cosine progression from start to end weight decay.