nemo_automodel.components.models.qwen3_moe.state_dict_adapter#

Module Contents#

Classes#

Qwen3MoeStateDictAdapter

Converts between HF Qwen3-MoE checkpoints and our grouped-experts native format.

Data#

API#

nemo_automodel.components.models.qwen3_moe.state_dict_adapter.logger#

‘getLogger(…)’

nemo_automodel.components.models.qwen3_moe.state_dict_adapter._LORA_EXPERT_SUFFIXES#

(‘lora_gate_and_up_A’, ‘lora_gate_and_up_B’, ‘lora_down_A’, ‘lora_down_B’)

class nemo_automodel.components.models.qwen3_moe.state_dict_adapter.Qwen3MoeStateDictAdapter(
config: Any,
moe_config: nemo_automodel.components.moe.config.MoEConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
dtype: torch.dtype = torch.float32,
)#

Bases: nemo_automodel.components.moe.state_dict_mixin.MoESplitExpertsStateDictMixin, nemo_automodel.components.checkpoint.state_dict_adapter.StateDictAdapter

Converts between HF Qwen3-MoE checkpoints and our grouped-experts native format.

Qwen3-MoE HF experts use keys: model.layers.{L}.mlp.experts.{E}.gate_proj.weight model.layers.{L}.mlp.experts.{E}.up_proj.weight model.layers.{L}.mlp.experts.{E}.down_proj.weight

Our native format groups them into: model.layers.{L}.mlp.experts.gate_and_up_projs # [n_experts, dim, 2*moe_inter_dim] model.layers.{L}.mlp.experts.down_projs # [n_experts, moe_inter_dim, dim]

Initialization

to_hf(
state_dict: dict[str, Any],
exclude_key_regex: Optional[str] = None,
quantization: bool = False,
**kwargs,
) dict[str, Any]#
convert_single_tensor_to_hf(
fqn: str,
tensor: Any,
**kwargs,
) list[tuple[str, Any]]#

Convert a single tensor from native format to HuggingFace format.

When v4_compatible=False (the default), LoRA expert tensors are emitted in PEFT v0.18+ ParamWrapper format so that PeftModel.from_pretrained() can load them directly. When v4_compatible=True, the legacy per-expert split is used instead (via the parent mixin).

Parameters:
  • fqn – Fully qualified name of the tensor in native format

  • tensor – The tensor to convert

  • **kwargs – Additional arguments for conversion

Returns:

List of (fqn, tensor) tuples in HuggingFace format

_convert_lora_to_paramwrapper(
fqn: str,
tensor: torch.Tensor,
) list[tuple[str, torch.Tensor]]#

Convert a single grouped MoE LoRA tensor to PEFT ParamWrapper format.

ParamWrapper format stores fused 3-D expert LoRA parameters as 2-D tensors with the expert dimension folded into the rank dimension.

Shape mapping (automodel native -> ParamWrapper):

down_proj (outer wrapper, NO base_layer prefix — processed first alphabetically):

  • lora_down_B (E, r, H) -> lora_A.weight (r*E, H) reshape

  • lora_down_A (E, I, r) -> lora_B.weight (I, r*E) permute+reshape

gate_up_proj (inner wrapper, HAS base_layer. prefix):

  • lora_gate_and_up_B (E, r, 2I) -> base_layer.lora_A.weight (rE, 2*I) reshape

  • lora_gate_and_up_A (E, H, r) -> base_layer.lora_B.weight (H, r*E) permute+reshape

Returns:

List containing one (fqn, tensor) tuple in ParamWrapper format.

from_hf(
hf_state_dict: dict[str, Any],
device_mesh: Optional[torch.distributed.device_mesh.DeviceMesh] = None,
**kwargs,
) dict[str, Any]#

Convert HF checkpoint to native format, handling ParamWrapper LoRA keys.

Before delegating to the parent _from_hf_w_merged_experts (which handles legacy per-expert LoRA format), this method scans for ParamWrapper-format LoRA keys and converts them back to the native grouped format expected by GroupedExpertsLoRA.

_convert_paramwrapper_to_native(
state_dict: dict[str, Any],
) dict[str, Any]#

Convert PEFT ParamWrapper LoRA keys to native grouped MoE LoRA format.

This is the reverse of _convert_lora_to_paramwrapper. It detects ParamWrapper-format keys and converts them back to the 3-D grouped tensors expected by GroupedExpertsLoRA.

Reverse transforms (down_proj is outer, gate_up_proj is inner):

  • experts.lora_A.weight (r*E, H) -> (E, r, H) = lora_down_B

  • experts.lora_B.weight (I, r*E) -> (E, I, r) = lora_down_A

  • experts.base_layer.lora_A.weight (rE, 2I) -> (E, r, 2*I) = lora_gate_and_up_B

  • experts.base_layer.lora_B.weight (H, r*E) -> (E, H, r) = lora_gate_and_up_A