core.optimizer#

Subpackages#

Submodules#

Package Contents#

Functions#

get_standard_config_overrides

Get standard config overrides for the optimizer, handling decoupled LR and common wd skips.

get_mup_config_overrides

Get MuP config overrides for per-layer LR and Adam epsilon scaling.

_get_param_groups

Create parameter groups for optimizer.

_get_param_groups_and_buffers

Returns parameter groups and buffer for optimizer.

_get_megatron_optimizer_based_on_param_groups

Get Megatron optimizer based on parameter groups.

check_config_overrides_consistency

Check if the config overrides are consistent with the config.

_get_megatron_emerging_optimizer

Build an emerging optimizer (e.g. Muon) for the given model chunks.

get_megatron_optimizer

Retrieve the Megatron optimizer for model chunks.

Data#

API#

core.optimizer.HAVE_EMERGING_OPTIMIZERS#

None

core.optimizer.logger#

‘getLogger(…)’

core.optimizer.get_standard_config_overrides(
config: core.optimizer.optimizer_config.OptimizerConfig,
) Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]#

Get standard config overrides for the optimizer, handling decoupled LR and common wd skips.

Parameters:

config (OptimizerConfig) – optimizer configuration object.

Returns:

standard config overrides.

Return type:

Dict[ParamKey, ParamGroupOverride]

core.optimizer.get_mup_config_overrides(
config: core.optimizer.optimizer_config.OptimizerConfig,
mup_width_mult: float,
optimizer_type: str = 'adam',
) Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]#

Get MuP config overrides for per-layer LR and Adam epsilon scaling.

In MuP, optimizer learning rates are adjusted by parameter class to ensure stable update scales across model widths and enable hyperparameter transfer.

MuP optimizer scaling rules (as implemented here):

  • Adam/AdamW:

    • hidden (matrix-like) lr = base_lr / width_mult

    • hidden (matrix-like) eps = base_eps / width_mult

    • vector-like params keep base lr and eps

  • SGD:

    • vector-like lr = base_lr * width_mult

    • hidden (matrix-like) lr keeps base_lr in the current uniform-width setup

    • no eps override is applied

  • Non-Adam optimizers:

    • hidden (matrix-like) lr = base_lr / width_mult

    • no eps override is applied.

    • for Muon optimizers, matrix-like params managed by Muon itself are excluded from these Adam-style MuP overrides.

With decoupled_lr enabled, embedding/output params continue using decoupled LR and MuP will not override those explicit decoupled values.

Parameters:
  • config (OptimizerConfig) – optimizer configuration object.

  • mup_width_mult (float) – Width multiplier (hidden_size / base_hidden_size).

  • optimizer_type (str) – Optimizer type string from config.optimizer.

Returns:

MuP optimizer overrides.

Return type:

Dict[ParamKey, ParamGroupOverride]

core.optimizer._get_param_groups(
model_chunks: List[core.transformer.module.MegatronModule],
config: core.optimizer.optimizer_config.OptimizerConfig,
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]],
) List[Dict]#

Create parameter groups for optimizer.

Creates parameter groups from provided optimizer config object.

NOTE There can be more than one match between a ParamKey and a parameter. What we do is merge all of the matching ParamKey overrides into a single ParamGroupOverride for that parameter and use that as the key for that parameter. Any parameters that get the same set of merged overrides will be mapped into the same parameter group.

Parameters:
  • model_chunks (List[MegatronModule]) – model chunks to create parameter groups for.

  • config (OptimizerConfig) – optimizer configuration object.

  • config_overrides (Optional[Dict[ParamKey, ParamGroupOverride]) – optimizer overrides, specified on a per-layer basis. NOTE: if you want to skip applying weight decay on bias and length 1 parameters, and also do not want to do any other overrides, set this to an empty dictionary rather than the default value of None.

Returns:

List of parameter groups.

core.optimizer._get_param_groups_and_buffers(
model_chunks: List[core.transformer.module.MegatronModule],
model_chunk_offset: int,
config: core.optimizer.optimizer_config.OptimizerConfig,
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]],
filter_fn: Callable,
buffer_name: str,
) Tuple[List[Dict], Dict[int, List[core.distributed.param_and_grad_buffer._ParamAndGradBuffer]]]#

Returns parameter groups and buffer for optimizer.

Parameters:
  • model_chunks (List[MegatronModule]) – model chunks to create parameter groups for.

  • model_chunk_offset (int) – offset of model_chunks in global model_chunks list.

  • config (OptimizerConfig) – optimizer configuration object.

  • config_overrides (Optional[Dict[ParamKey, ParamGroupOverride]) – optimizer/scheduler overrides, specified on the basis of ParamKey matches with each parameter.

  • lr (float) – learning rate.

  • min_lr (float) – minimum learning rate.

  • filter_fn (callable) – filtering function for param_groups.

  • buffer_name (str) – name of buffer.

Returns:

List of parameter groups and dictionary of model chunk IDs to buffers.

core.optimizer._get_megatron_optimizer_based_on_param_groups(
config: core.optimizer.optimizer_config.OptimizerConfig,
model_chunks: List[core.transformer.module.MegatronModule],
param_groups: List,
per_model_buffers: Optional[Dict[int, List[core.distributed.param_and_grad_buffer._ParamAndGradBuffer]]] = None,
model_parallel_group: Optional[torch.distributed.ProcessGroup] = None,
data_parallel_group: Optional[torch.distributed.ProcessGroup] = None,
data_parallel_group_gloo: Optional[torch.distributed.ProcessGroup] = None,
data_parallel_group_idx: Optional[int] = None,
intra_dist_opt_group: Optional[torch.distributed.ProcessGroup] = None,
distributed_optimizer_instance_id: Optional[int] = 0,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
skip_megatron_wrapping: bool = False,
) Union[core.optimizer.optimizer.MegatronOptimizer, Tuple[Optional[torch.optim.Optimizer], Optional[Callable]]]#

Get Megatron optimizer based on parameter groups.

Parameters:
  • config (OptimizerConfig) – optimizer configuration object.

  • model_chunks (list) – list of model chunks.

  • param_groups (list) – list of parameter groups.

  • per_model_buffers (dict, optional) – buffers for distributed optimizer. Defaults to None.

  • data_parallel_group (torch.distributed.ProcessGroup, optional) – data-parallel group for distributed optimizer. Defaults to None.

  • data_parallel_group_gloo (torch.distributed.ProcessGroup, optional) – gloo data-parallel group for distributed optimizer. Defaults to None.

  • data_parallel_group_idx (int, optional) – data-parallel group index for distributed optimizer. Defaults to None.

  • distributed_optimizer_instance_id (int, optional) – Distributed optimizer instance. Defaults 0.

  • skip_megatron_wrapping (bool) – if True, return a (optimizer, init_state_fn) tuple of the raw PyTorch optimizer without any Megatron wrapping. Useful when the caller (e.g. LayerWiseDistributedOptimizer) performs its own wrapping.

Returns:

Instance of MegatronOptimizer, or (optimizer, init_state_fn) when skip_megatron_wrapping=True.

core.optimizer.check_config_overrides_consistency(
config: core.optimizer.optimizer_config.OptimizerConfig,
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]],
)#

Check if the config overrides are consistent with the config.

core.optimizer._get_megatron_emerging_optimizer(
config: core.optimizer.optimizer_config.OptimizerConfig,
model_chunks: List[core.transformer.module.MegatronModule],
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, Any]] = None,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
) core.optimizer.optimizer.MegatronOptimizer#

Build an emerging optimizer (e.g. Muon) for the given model chunks.

Parameter separation (e.g., linear weights -> Muon, rest -> Adam) is expressed as a config_override, the same mechanism used for weight-decay and learning-rate overrides. Adam/SGD groups are delegated to _get_megatron_optimizer_based_on_param_groups so they go through the exact same code path as the standard optimizer factory.

When config.use_layer_wise_distributed_optimizer is True, the underlying optimizers are wrapped with :class:LayerWiseDistributedOptimizer.

core.optimizer.get_megatron_optimizer(
config: core.optimizer.optimizer_config.OptimizerConfig,
model_chunks: List[core.transformer.module.MegatronModule],
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]] = None,
use_gloo_process_groups: bool = True,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
dump_param_to_param_group_map: Optional[str] = None,
) core.optimizer.optimizer.MegatronOptimizer#

Retrieve the Megatron optimizer for model chunks.

Handles both standard optimizers (Adam, SGD) and emerging optimizers (e.g. Muon). We use separate optimizers for expert parameters and non-expert parameters. For emerging optimizers with config.use_layer_wise_distributed_optimizer=True, the optimizer is automatically wrapped with :class:LayerWiseDistributedOptimizer.

Parameters:
  • config (OptimizerConfig) – optimizer configuration object.

  • model_chunks (List[MegatronModule]) – model chunks to get optimizer for.

  • config_overrides (Optional[Dict[ParamKey, OptimizerConfig]]) – optional dictionary of optimizer configuration objects to override default optimizer behavior for different subsets of parameters (identified by ParamKey).

  • use_gloo_process_groups (bool) – if false, disable use of Gloo process groups in underlying Megatron optimizers.

  • pg_collection – Optional unified process group for distributed training.

  • dump_param_to_param_group_map (Optional[str]) – path to dump parameter to param group map.

Returns:

Instance of MegatronOptimizer.