Optimizer and Scheduler Configuration#
The optimizer and scheduler configurations control optimization algorithms, learning rate schedules, and weight decay strategies.
OptimizerConfig (from Megatron Core)#
The OptimizerConfig
contains all parameters for the optimization algorithm and comes directly from Megatron Core. Key parameters include:
Parameter |
Type |
Description |
---|---|---|
|
|
Optimizer type (“adam”, “sgd”, etc.) |
|
|
Base learning rate |
|
|
Minimum learning rate for decay schedules |
|
|
L2 regularization coefficient |
|
|
Adam optimizer beta1 parameter |
|
|
Adam optimizer beta2 parameter |
|
|
Adam optimizer epsilon parameter |
|
|
Gradient clipping threshold |
|
|
Enable distributed optimizer for memory efficiency |
|
|
Overlap gradient reduction with computation |
|
|
Overlap parameter gathering with computation |
|
|
Use BF16 precision for training |
|
|
Use FP16 precision for training |
SchedulerConfig#
The SchedulerConfig
controls learning rate scheduling and weight decay progression throughout training.
Learning Rate Scheduling#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
|
|
Learning rate decay function |
|
|
|
Iterations to decay LR over (defaults to |
|
|
|
Iterations to linearly warmup learning rate |
|
|
|
Fraction of decay iterations to use for warmup |
|
|
|
Initial learning rate for warmup phase |
WSD (Warmup-Stable-Decay) Scheduling#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
|
|
Decay style for WSD annealing phase |
|
|
|
Iterations for WSD annealing phase |
Weight Decay Scheduling#
Parameters for controlling the progression of weight decay during training, including start and end values and the scheduling strategy:
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
|
|
Initial weight decay coefficient |
|
|
|
Final weight decay coefficient |
|
|
|
Weight decay progression style |
Checkpoint Integration#
Parameters for managing how scheduler settings are applied during checkpoint loading, allowing control over whether to prioritize config values or restore from saved state:
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
|
|
Reset scheduler values from config, ignoring checkpoint |
|
|
|
Use scheduler values from checkpoint, ignoring config |
Computed Fields#
These fields are automatically calculated during configuration validation and help align training schedules with the configured batch size and iteration counts:
Field |
Description |
---|---|
|
Total steps for warmup (calculated from iterations and batch size) |
|
Total steps for decay (calculated from iterations and batch size) |
|
Total steps for weight decay progression |
|
Total steps for WSD annealing phase |
Learning Rate Schedules#
The following scheduling strategies define how the learning rate evolves during training, each suited to different convergence behaviors and model types:
Schedule Type |
Description |
---|---|
Constant |
Learning rate remains fixed throughout training. |
Linear |
Learning rate decreases linearly from the base LR to the minimum LR. |
Cosine |
Learning rate follows a cosine decay curve from base LR to minimum LR. |
Inverse Square Root |
Learning rate decays proportionally to the inverse square root of the step. |
WSD (Warmup-Stable-Decay)#
The WSD schedule divides learning rate progression into three distinct phases, offering fine-grained control over early ramp-up, mid-training stability, and final decay:
Phase |
Description |
---|---|
Warmup |
Learning rate increases linearly from initial value to base LR. |
Stable |
Learning rate remains constant at base LR. |
Decay |
Learning rate decays to minimum LR using a specified style (e.g., exponential, linear, cosine). |
Weight Decay Scheduling#
These scheduling options control how the weight decay coefficient changes over time, allowing for regularization strategies that adapt to different training phases:
Schedule Type |
Description |
---|---|
Constant |
Fixed weight decay throughout training. |
Linear |
Linear progression from start to end weight decay. |
Cosine |
Cosine progression from start to end weight decay. |