nemo_automodel.components.optim.utils#

Module Contents#

Functions#

is_dion_optimizer

_separate_param_groups

Separate model parameters into groups for Dion/Muon optimizers.

_get_dion_mesh

build_dion_optimizer

Build a Dion-family optimizer with parameter grouping.

Data#

API#

nemo_automodel.components.optim.utils._import_error: Exception | None#

None

nemo_automodel.components.optim.utils.logger#

β€˜getLogger(…)’

nemo_automodel.components.optim.utils.is_dion_optimizer(cfg_opt) bool#
nemo_automodel.components.optim.utils._separate_param_groups(
model: torch.nn.Module,
base_lr: float,
scalar_opt: str,
weight_decay: float,
scalar_betas: tuple[float, float] | None = None,
scalar_eps: float | None = None,
scalar_lr: float | None = None,
embed_lr: float | None = None,
lm_head_lr: float | None = None,
)#

Separate model parameters into groups for Dion/Muon optimizers.

Parameters:
  • model – The model to optimize.

  • base_lr – Base learning rate for matrix params (Muon algorithm).

  • scalar_opt – Optimizer algorithm for scalar params (β€œadamw” or β€œlion”).

  • weight_decay – Weight decay for vector params.

  • scalar_betas – (beta1, beta2) for scalar optimizer.

  • scalar_eps – Epsilon for scalar optimizer.

  • scalar_lr – Learning rate for scalar (vector/bias) params. Defaults to base_lr.

  • embed_lr – Learning rate for embedding params. Defaults to scalar_lr or base_lr.

  • lm_head_lr – Learning rate for lm_head. Defaults to base_lr / sqrt(d_in).

nemo_automodel.components.optim.utils._get_dion_mesh(distributed_mesh: Any) Any#
nemo_automodel.components.optim.utils.build_dion_optimizer(
cfg_opt,
model: torch.nn.Module,
distributed_mesh: Optional[Any] = None,
) Any#

Build a Dion-family optimizer with parameter grouping.

Parameters:
  • cfg_opt – ConfigNode for the optimizer.

  • model – Model whose parameters are to be optimized.

  • distributed_mesh – Optional DeviceMesh for FSDP/TP.

  • process_group – Optional ProcessGroup for DDP.