nemo_automodel.components.models.qwen3_moe.state_dict_adapter#
Module Contents#
Classes#
Converts between HF Qwen3-MoE checkpoints and our grouped-experts native format. |
Data#
API#
- nemo_automodel.components.models.qwen3_moe.state_dict_adapter.logger#
‘getLogger(…)’
- nemo_automodel.components.models.qwen3_moe.state_dict_adapter._LORA_EXPERT_SUFFIXES#
(‘lora_gate_and_up_A’, ‘lora_gate_and_up_B’, ‘lora_down_A’, ‘lora_down_B’)
- class nemo_automodel.components.models.qwen3_moe.state_dict_adapter.Qwen3MoeStateDictAdapter(
- config: Any,
- moe_config: nemo_automodel.components.moe.config.MoEConfig,
- backend: nemo_automodel.components.models.common.BackendConfig,
- dtype: torch.dtype = torch.float32,
Bases:
nemo_automodel.components.moe.state_dict_mixin.MoESplitExpertsStateDictMixin,nemo_automodel.components.checkpoint.state_dict_adapter.StateDictAdapterConverts between HF Qwen3-MoE checkpoints and our grouped-experts native format.
Qwen3-MoE HF experts use keys: model.layers.{L}.mlp.experts.{E}.gate_proj.weight model.layers.{L}.mlp.experts.{E}.up_proj.weight model.layers.{L}.mlp.experts.{E}.down_proj.weight
Our native format groups them into: model.layers.{L}.mlp.experts.gate_and_up_projs # [n_experts, dim, 2*moe_inter_dim] model.layers.{L}.mlp.experts.down_projs # [n_experts, moe_inter_dim, dim]
Initialization
- to_hf(
- state_dict: dict[str, Any],
- exclude_key_regex: Optional[str] = None,
- quantization: bool = False,
- **kwargs,
- convert_single_tensor_to_hf(
- fqn: str,
- tensor: Any,
- **kwargs,
Convert a single tensor from native format to HuggingFace format.
When
v4_compatible=False(the default), LoRA expert tensors are emitted in PEFT v0.18+ ParamWrapper format so thatPeftModel.from_pretrained()can load them directly. Whenv4_compatible=True, the legacy per-expert split is used instead (via the parent mixin).- Parameters:
fqn – Fully qualified name of the tensor in native format
tensor – The tensor to convert
**kwargs – Additional arguments for conversion
- Returns:
List of (fqn, tensor) tuples in HuggingFace format
- _convert_lora_to_paramwrapper(
- fqn: str,
- tensor: torch.Tensor,
Convert a single grouped MoE LoRA tensor to PEFT ParamWrapper format.
ParamWrapper format stores fused 3-D expert LoRA parameters as 2-D tensors with the expert dimension folded into the rank dimension.
Shape mapping (automodel native -> ParamWrapper):
down_proj (outer wrapper, NO
base_layerprefix — processed first alphabetically):lora_down_B(E, r, H) ->lora_A.weight(r*E, H) reshapelora_down_A(E, I, r) ->lora_B.weight(I, r*E) permute+reshape
gate_up_proj (inner wrapper, HAS
base_layer.prefix):lora_gate_and_up_B(E, r, 2I) ->base_layer.lora_A.weight(rE, 2*I) reshapelora_gate_and_up_A(E, H, r) ->base_layer.lora_B.weight(H, r*E) permute+reshape
- Returns:
List containing one
(fqn, tensor)tuple in ParamWrapper format.
- from_hf(
- hf_state_dict: dict[str, Any],
- device_mesh: Optional[torch.distributed.device_mesh.DeviceMesh] = None,
- **kwargs,
Convert HF checkpoint to native format, handling ParamWrapper LoRA keys.
Before delegating to the parent
_from_hf_w_merged_experts(which handles legacy per-expert LoRA format), this method scans for ParamWrapper-format LoRA keys and converts them back to the native grouped format expected byGroupedExpertsLoRA.
- _convert_paramwrapper_to_native(
- state_dict: dict[str, Any],
Convert PEFT ParamWrapper LoRA keys to native grouped MoE LoRA format.
This is the reverse of
_convert_lora_to_paramwrapper. It detects ParamWrapper-format keys and converts them back to the 3-D grouped tensors expected by GroupedExpertsLoRA.Reverse transforms (down_proj is outer, gate_up_proj is inner):
experts.lora_A.weight(r*E, H) -> (E, r, H) = lora_down_Bexperts.lora_B.weight(I, r*E) -> (E, I, r) = lora_down_Aexperts.base_layer.lora_A.weight(rE, 2I) -> (E, r, 2*I) = lora_gate_and_up_Bexperts.base_layer.lora_B.weight(H, r*E) -> (E, H, r) = lora_gate_and_up_A