nemo_automodel.components.models.qwen3_moe.state_dict_adapter

Module Contents

Classes

Name	Description
`Qwen3MoeStateDictAdapter`	Converts between HF Qwen3-MoE checkpoints and our grouped-experts native format.

Data

_LORA_EXPERT_SUFFIXES

logger

API

class nemo_automodel.components.models.qwen3_moe.state_dict_adapter.Qwen3MoeStateDictAdapter(
    config: typing.Any,
    moe_config: nemo_automodel.components.moe.config.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    dtype: torch.dtype = torch.float32
)

Bases: MoESplitExpertsStateDictMixin, StateDictAdapter

Converts between HF Qwen3-MoE checkpoints and our grouped-experts native format.

nemo_automodel.components.models.qwen3_moe.state_dict_adapter.Qwen3MoeStateDictAdapter._convert_lora_to_paramwrapper(
    fqn: str,
    tensor: torch.Tensor
) -> list[tuple[str, torch.Tensor]]

Convert a single grouped MoE LoRA tensor to PEFT ParamWrapper format.

ParamWrapper format stores fused 3-D expert LoRA parameters as 2-D tensors with the expert dimension folded into the rank dimension.

Shape mapping (automodel native -> ParamWrapper):

down_proj (outer wrapper, NO base_layer prefix — processed first alphabetically):

lora_down_B (E, r, H) -> lora_A.weight (r*E, H) reshape
lora_down_A (E, I, r) -> lora_B.weight (I, r*E) permute+reshape

gate_up_proj (inner wrapper, HAS base_layer. prefix):

lora_gate_and_up_B (E, r, 2I) -> base_layer.lora_A.weight (rE, 2*I) reshape
lora_gate_and_up_A (E, H, r) -> base_layer.lora_B.weight (H, r*E) permute+reshape

Returns: list[tuple[str, torch.Tensor]]

List containing one (fqn, tensor) tuple in ParamWrapper format.

nemo_automodel.components.models.qwen3_moe.state_dict_adapter.Qwen3MoeStateDictAdapter._convert_paramwrapper_to_native(
    state_dict: dict[str, typing.Any]
) -> dict[str, typing.Any]

Convert PEFT ParamWrapper LoRA keys to native grouped MoE LoRA format.

This is the reverse of _convert_lora_to_paramwrapper. It detects ParamWrapper-format keys and converts them back to the 3-D grouped tensors expected by GroupedExpertsLoRA.

Reverse transforms (down_proj is outer, gate_up_proj is inner):

experts.lora_A.weight (r*E, H) -> (E, r, H) = lora_down_B
experts.lora_B.weight (I, r*E) -> (E, I, r) = lora_down_A
experts.base_layer.lora_A.weight (rE, 2I) -> (E, r, 2*I) = lora_gate_and_up_B
experts.base_layer.lora_B.weight (H, r*E) -> (E, H, r) = lora_gate_and_up_A

nemo_automodel.components.models.qwen3_moe.state_dict_adapter.Qwen3MoeStateDictAdapter.convert_single_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]

Convert a single tensor from native format to HuggingFace format.

When v4_compatible=False (the default), LoRA expert tensors are emitted in PEFT v0.18+ ParamWrapper format so that PeftModel.from_pretrained() can load them directly. When v4_compatible=True, the legacy per-expert split is used instead (via the parent mixin).

Parameters:

fqn

str

Fully qualified name of the tensor in native format

tensor

Any

The tensor to convert

**kwargs

Defaults to {}

Additional arguments for conversion

Returns: list[tuple[str, Any]]

List of (fqn, tensor) tuples in HuggingFace format

nemo_automodel.components.models.qwen3_moe.state_dict_adapter.Qwen3MoeStateDictAdapter.from_hf(
    hf_state_dict: dict[str, typing.Any],
    device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None,
    kwargs = {}
) -> dict[str, typing.Any]

Convert HF checkpoint to native format, handling ParamWrapper LoRA keys.

Before delegating to the parent _from_hf_w_merged_experts (which handles legacy per-expert LoRA format), this method scans for ParamWrapper-format LoRA keys and converts them back to the native grouped format expected by GroupedExpertsLoRA.

nemo_automodel.components.models.qwen3_moe.state_dict_adapter.Qwen3MoeStateDictAdapter.to_hf(
    state_dict: dict[str, typing.Any],
    exclude_key_regex: typing.Optional[str] = None,
    quantization: bool = False,
    kwargs = {}
) -> dict[str, typing.Any]

nemo_automodel.components.models.qwen3_moe.state_dict_adapter._LORA_EXPERT_SUFFIXES = ('lora_gate_and_up_A', 'lora_gate_and_up_B', 'lora_down_A', 'lora_down_B')

nemo_automodel.components.models.qwen3_moe.state_dict_adapter.logger = logging.getLogger(__name__)