nemo_automodel.components.optim.optimizer
nemo_automodel.components.optim.optimizer
Typed optimizer + LR scheduler configs (TorchTitan-style).
Each optimizer config is a plain dataclass exposing the full parameter surface
as named fields (no opaque **kwargs). Reading the dataclass tells you
exactly what you can configure.
Every config owns its own construction via config.build(model, ...), which
loops over model.parts and applies the per-part concerns (TP foreach,
Megatron-FSDP sharding). Subclasses only implement the small
_build_optimizer(params) hook; configs with bespoke construction needs
(e.g. :class:MuonConfig’s Dion parameter grouping) override build directly.
:func:build_optimizer is a thin dispatcher: it normalizes its
optimizer_config argument to an :class:OptimizerConfig and returns
config.build(model, ...). The argument is either:
- a typed :class:
OptimizerConfiginstance — the Automodel-native path; or - a
(name_or_path, kwargs)tuple, wherename_or_pathis a short registry name ("adam","adamw","muon", …) or a dotted import path ("torch.optim.AdamW"). It is resolved and constructed withkwargs: a typed config from its fields, or — for any other callable — the escape hatch for external integrations (e.g. veRL) via :class:OptimizerFromFactoryConfig.
Module Contents
Classes
Functions
Data
API
Bases: _DionConfigBase
dion.Dion2 — recommended successor to the legacy Dion optimizer.
Bases: _DionConfigBase
dion.Dion — legacy low-rank optimizer (prefer :class:Dion2Config).
Legacy Dion takes separate replicate/outer/inner shard meshes; for FSDP2 the
resolved 1-D shard submesh maps to outer_shard_mesh.
Bases: OptimizerConfig
transformer_engine.pytorch.optimizers.FusedAdam.
LR scheduler configuration. None fields are computed by
:meth:build from the training schedule (total steps, optimizer base LR/WD).
Build one LR scheduler per optimizer.
None fields are filled from the training schedule and each
optimizer’s base LR/WD.
Parameters:
The optimizer(s) to schedule.
The step scheduler, used to derive total steps.
Returns: list[OptimizerParamScheduler]
class:OptimizerParamScheduler per optimizer.
Bases: _DionConfigBase
dion.Muon — matrix-aware update for 2D+ params, scalar fallback for 1D.
Bases: _DionConfigBase
dion.NorMuon — Muon variant with neuron-wise normalization.
Base optimizer config.
Subclasses expose their full field surface and implement
:meth:_build_optimizer, the per-part hook that constructs a single
optimizer from a list of parameters. :meth:build owns the shared
orchestration (per-part loop, TP foreach) and is rarely overridden —
only by configs whose construction does not fit the
parameters -> optimizer shape (e.g. :class:MuonConfig). Megatron-FSDP
optimizer sharding is no longer applied here; the recipe layer re-applies it
via shard_optimizers_for_megatron_fsdp(...).
Construct a single optimizer for params (one model part).
Build one optimizer per model.parts (or [model]).
Applies the shared per-part concern (TP foreach disabling) and
delegates the actual optimizer instantiation to :meth:_build_optimizer.
Megatron-FSDP optimizer sharding is applied by the recipe layer, not here.
Parameters:
Model (or model with .parts) to optimize.
Device mesh used for tensor/data parallelism.
Whether the model is being trained with PEFT (suppresses the bf16 torch-Adam precision warning).
Returns: list[torch.optim.Optimizer]
One optimizer per model part.
Build one optimizer from caller-defined parameter groups.
Bases: OptimizerConfig
Build an optimizer from an arbitrary factory callable plus kwargs.
The integration escape hatch (e.g. veRL): rather than exposing typed fields,
it wraps an optimizer class/callable and the **kwargs to construct it.
This keeps the factory path on the same config.build(model, ...) contract
as the typed configs, so :func:build_optimizer never has to special-case it.
Hyperparameters live in :attr:kwargs; the inherited lr/weight_decay
fields are unused. The factory is called as factory(params=..., **kwargs);
Dion-family optimizers (which need parameter grouping) should use the typed
:class:MuonConfig instead.
Bases: OptimizerConfig
Shared base for the dion-family typed configs (Muon / NorMuon / Dion2 / Dion).
Dion optimizers need Dion’s parameter grouping (built from the model) and the
device mesh rather than a flat parameter list, so :meth:build runs grouping
per model part. The grouping-only fields below (scalar_* / *_lr) are
consumed by :func:build_dion_optimizer and stripped from the constructor
kwargs. Dion is incompatible with Megatron-FSDP optimizer sharding; this is
enforced at the recipe layer (supports_megatron_fsdp_sharding = False
drives an allow=False sharding call that asserts rather than silently
returning an unsharded optimizer).
Instantiate the concrete dion optimizer from grouped params + filtered kwargs.
Return True if factory accepts a foreach kwarg.
torch.optim optimizers take foreach; external factories such as TE
FusedAdam do not, so passing it would raise TypeError.
Return False when TP > 1 (foreach is unsupported), else None.
Import an object from a dotted path, e.g. "torch.optim.AdamW".
Build one optimizer per model.parts (or [model]).
Thin dispatcher: it normalizes config to an :class:OptimizerConfig and
returns config.build(model, ...). Per-part concerns (TP foreach,
Dion param grouping) live on the config. Megatron-FSDP optimizer sharding is
re-applied separately by the recipe layer.
config is one of:
- a typed :class:
OptimizerConfiginstance — the Automodel-native path. - a
(name_or_path, kwargs)tuple, wherename_or_pathis a short registry name (see :data:OPTIMIZER_CONFIG_REGISTRY, e.g."adamw") or a dotted import path (e.g."torch.optim.AdamW"), andkwargsare the constructor arguments. A registry/import-path that resolves to an :class:OptimizerConfigsubclass is built from its typed fields; any other callable is wrapped in an :class:OptimizerFromFactoryConfig(the escape hatch for external integrations, e.g. veRL).
Parameters:
Model (or model with .parts) to optimize.
An :class:OptimizerConfig instance or a (name_or_path, kwargs) tuple.
Device mesh used for tensor/data parallelism.
Returns: list[torch.optim.Optimizer]
One optimizer per model part.
Normalize an optimizer target plus kwargs into an :class:OptimizerConfig.
This is the single normalization entry point shared by the recipe layer
(which resolves a YAML _target_ to a Python object) and
:func:build_optimizer (which accepts (name_or_path, kwargs) tuples).
target is one of:
- an :class:
OptimizerConfiginstance — returned as-is (kwargsignored, since the instance already carries its typed fields). - an :class:
OptimizerConfigsubclass — instantiated from its typed fields with**kwargs. - a string — a registry short name (see :data:
OPTIMIZER_CONFIG_REGISTRY, e.g."adamw") or a dotted import path (e.g."torch.optim.AdamW"); it is resolved and then handled as a subclass or callable. - any other optimizer callable/class — wrapped in an
:class:
OptimizerFromFactoryConfig(the escape hatch for external integrations, e.g. veRL).
Parameters:
The optimizer config instance/subclass, registry name or import path, or optimizer callable to normalize.
Constructor arguments for the resolved config/callable.
Returns: OptimizerConfig
class:OptimizerConfig ready to build(...).