`nemo_automodel.components.models.hy_v3.state_dict_adapter`#

State dict conversion between the on-disk tencent/Hy3-preview HF checkpoint and Automodel’s native (grouped-experts) format.

On-disk HF format (what tencent/Hy3-preview safetensors actually contain): model.layers.{L}.mlp.expert_bias # [n_experts] model.layers.{L}.mlp.router.gate.weight # [n_experts, hidden] model.layers.{L}.mlp.experts.{E}.gate_proj.weight # [moe_inter, hidden] model.layers.{L}.mlp.experts.{E}.up_proj.weight # [moe_inter, hidden] model.layers.{L}.mlp.experts.{E}.down_proj.weight # [hidden, moe_inter] model.layers.{L}.mlp.shared_mlp.{gate,up,down}_proj.weight # [moe_inter, hidden] / [hidden, moe_inter]

Automodel native format (matches the rest of the MoE stack): model.layers.{L}.mlp.gate.e_score_correction_bias # [n_local] (on Gate, not MoE) model.layers.{L}.mlp.gate.weight # [n_experts, hidden] model.layers.{L}.mlp.experts.gate_and_up_projs # [n_local, hidden, 2*moe_inter] model.layers.{L}.mlp.experts.down_projs # [n_local, moe_inter, hidden] model.layers.{L}.mlp.shared_experts.{gate,up,down}_proj.weight # unchanged shapes

Differences (vs. every other Automodel MoE adapter):

Per-expert split tensors -> grouped (handled by MoESplitExpertsStateDictMixin).
Three HYV3-specific name renames: expert_bias <-> gate.e_score_correction_bias, router.gate.weight <-> gate.weight, shared_mlp.* <-> shared_experts.*.
MTP layers (indices >= num_hidden_layers) on disk must be filtered out on load.

Why the renames live in the adapter rather than in the storage reader’s key_mapping: nemo_automodel/components/checkpoint/checkpointing.py:507 deliberately passes reader_key_mapping=None when a model has a state_dict_adapter (to avoid double-translation). So the adapter’s to_hf / from_hf must produce keys that match the actual on-disk strings.

Module Contents#

Classes#

HYV3StateDictAdapter

Bridges Automodel native (grouped experts) and tencent/Hy3-preview on-disk HF.

Data#

`logger`
`_NATIVE_TO_HF_RENAMES`
`_HF_TO_NATIVE_RENAMES`

API#

nemo_automodel.components.models.hy_v3.state_dict_adapter.logger#: ‘getLogger(…)’

nemo_automodel.components.models.hy_v3.state_dict_adapter._NATIVE_TO_HF_RENAMES: tuple[tuple[re.Pattern[str], str], ...]#: ((), (), ())

nemo_automodel.components.models.hy_v3.state_dict_adapter._HF_TO_NATIVE_RENAMES: tuple[tuple[re.Pattern[str], str], ...]#: ((), (), ())

class nemo_automodel.components.models.hy_v3.state_dict_adapter.HYV3StateDictAdapter( config: Any, moe_config: nemo_automodel.components.moe.config.MoEConfig, backend: nemo_automodel.components.models.common.BackendConfig, dtype: torch.dtype = torch.bfloat16, )#

Bases: nemo_automodel.components.moe.state_dict_mixin.MoESplitExpertsStateDictMixin, nemo_automodel.components.checkpoint.state_dict_adapter.StateDictAdapter

Bridges Automodel native (grouped experts) and tencent/Hy3-preview on-disk HF.

Inherits the per-expert split/merge logic from MoESplitExpertsStateDictMixin; only the three HYV3-specific name renames + MTP-layer filtering live here.

Initialization

to_hf(

state_dict: dict[str, Any],

exclude_key_regex: Optional[str] = None,

**kwargs,

) → dict[str, Any]#

Convert native state dict back to the on-disk Tencent format.

Steps:

Split grouped expert tensors into per-expert HF keys (mixin).
Apply HYV3 name renames (gate.e_score_correction_bias -> expert_bias, gate.weight -> router.gate.weight, shared_experts. -> shared_mlp.).

from_hf(

hf_state_dict: dict[str, Any],

device_mesh: Optional[torch.distributed.device_mesh.DeviceMesh] = None,

**kwargs,

) → dict[str, Any]#

Convert the on-disk Tencent state dict to native format.

Steps:

Drop MTP (multi-token prediction) layer keys.
Apply HYV3 name renames (on-disk -> native HF naming).
Merge per-expert split tensors into grouped form via the mixin (validates expert availability against the rank’s EP slice).

convert_single_tensor_to_hf(

fqn: str,

tensor: Any,

**kwargs,

) → list[tuple[str, Any]]#

Per-tensor variant of to_hf (used by save paths that stream tensors).

Mirrors to_hf but operating on one (fqn, tensor) at a time:

Try the mixin’s per-expert split. Returns multiple (key, tensor) pairs when fqn names a grouped expert tensor; otherwise returns None.
Apply HYV3 name renames to whichever key set we end up with.

_is_mtp_key(key: str) → bool#: Return True if key belongs to an MTP layer (index >= num_hidden_layers).

nemo_automodel.components.models.hy_v3.state_dict_adapter#

Module Contents#

Classes#

Data#

API#

`nemo_automodel.components.models.hy_v3.state_dict_adapter`#