core.optimizer.muon#
Megatron muon optimizer wrapper to handle tensor-parallel.
Module Contents#
Classes#
Tensor Parallel Muon optimizer. |
Functions#
This function is used to get the muon optimizer for the model chunks. It is used to get the muon optimizer for the model chunks. |
Data#
API#
- core.optimizer.muon.logger#
‘getLogger(…)’
- class core.optimizer.muon.TensorParallelMuon(
- params: torch.optim.optimizer.ParamsT,
- lr: float = 0.0003,
- momentum_beta: float = 0.95,
- use_nesterov: bool = True,
- weight_decay: float = 0.01,
- use_decoupled_weight_decay: bool = True,
- split_qkv: bool = False,
- is_qkv_fn: Callable[[torch.Tensor], bool] | None = None,
- qkv_split_shapes: tuple[int, int, int] | None = None,
- fp32_matmul_prec: str = 'medium',
- coefficient_type: str = 'quintic',
- num_ns_steps: int = 5,
- scale_mode: str = 'spectral',
- extra_scale_factor: float = 1.0,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
- mode: Literal[blockwise, duplicated, distributed] = 'duplicated',
Bases:
emerging_optimizers.orthogonalized_optimizers.OrthogonalizedOptimizerTensor Parallel Muon optimizer.
Initialization
- orthogonalize(
- p: torch.Tensor,
- grad: torch.Tensor,
- **kwargs: Any,
Orthogonalize the momentum.
- Parameters:
p – The parameter tensor. i is necessary to pass param tensor in addition to momentum because a lot of information is only available in the param tensor, attributes for example.
grad – The momentum tensor.
- Returns:
The orthogonalized gradient tensor.
- core.optimizer.muon.get_megatron_muon_optimizer(
- config: core.optimizer.optimizer_config.OptimizerConfig,
- model_chunks: List[megatron.core.transformer.module.MegatronModule],
- config_overrides: Optional[Dict[core.optimizer.optimizer_config.ParamKey, megatron.core.optimizer_param_scheduler.ParamGroupOverride]] = None,
- use_gloo_process_groups: bool = True,
- layer_wise_distributed_optimizer: bool = False,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
This function is used to get the muon optimizer for the model chunks. It is used to get the muon optimizer for the model chunks.
- Parameters:
config (OptimizerConfig) – optimizer configuration object.
model_chunks (List[MegatronModule]) – model chunks to get optimizer for.
use_gloo_process_groups (bool) – if false, disable use of Gloo process groups in underlying Megatron optimizers.
layer_wise_distributed_optimizer (bool) – if true, use layer-wise distributed optimizer. Defaults to False.