core.optimizer.muon#

Megatron muon optimizer wrapper to handle tensor-parallel.

Module Contents#

Classes#

TensorParallelMuon

Tensor Parallel Muon optimizer.

Functions#

get_megatron_muon_optimizer

This function is used to get the muon optimizer for the model chunks. It is used to get the muon optimizer for the model chunks.

Data#

API#

core.optimizer.muon.logger#

‘getLogger(…)’

class core.optimizer.muon.TensorParallelMuon(
params: torch.optim.optimizer.ParamsT,
lr: float = 0.0003,
momentum_beta: float = 0.95,
use_nesterov: bool = True,
weight_decay: float = 0.01,
use_decoupled_weight_decay: bool = True,
split_qkv: bool = False,
is_qkv_fn: Callable[[torch.Tensor], bool] | None = None,
qkv_split_shapes: tuple[int, int, int] | None = None,
fp32_matmul_prec: str = 'medium',
coefficient_type: str = 'quintic',
num_ns_steps: int = 5,
scale_mode: str = 'spectral',
extra_scale_factor: float = 1.0,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
mode: Literal[blockwise, duplicated, distributed] = 'duplicated',
)#

Bases: emerging_optimizers.orthogonalized_optimizers.OrthogonalizedOptimizer

Tensor Parallel Muon optimizer.

Initialization

orthogonalize(
p: torch.Tensor,
grad: torch.Tensor,
**kwargs: Any,
) torch.Tensor#

Orthogonalize the momentum.

Parameters:
  • p – The parameter tensor. i is necessary to pass param tensor in addition to momentum because a lot of information is only available in the param tensor, attributes for example.

  • grad – The momentum tensor.

Returns:

The orthogonalized gradient tensor.

core.optimizer.muon.get_megatron_muon_optimizer(
config: core.optimizer.optimizer_config.OptimizerConfig,
model_chunks: List[megatron.core.transformer.module.MegatronModule],
config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]] = None,
use_gloo_process_groups: bool = True,
layer_wise_distributed_optimizer: bool = False,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
) core.optimizer.optimizer.MegatronOptimizer#

This function is used to get the muon optimizer for the model chunks. It is used to get the muon optimizer for the model chunks.

Parameters:
  • config (OptimizerConfig) – optimizer configuration object.

  • model_chunks (List[MegatronModule]) – model chunks to get optimizer for.

  • use_gloo_process_groups (bool) – if false, disable use of Gloo process groups in underlying Megatron optimizers.

  • layer_wise_distributed_optimizer (bool) – if true, use layer-wise distributed optimizer. Defaults to False.