nemo_automodel.components.models.qwen3_vl_moe.state_dict_adapter#

Module Contents#

Classes#

Qwen3VLMoeStateDictAdapter

Converts between HF Qwen3-VL-MoE checkpoints and grouped-experts native format.

API#

class nemo_automodel.components.models.qwen3_vl_moe.state_dict_adapter.Qwen3VLMoeStateDictAdapter(
config: Any,
moe_config: nemo_automodel.components.moe.config.MoEConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
dtype: torch.dtype = torch.float32,
)#

Bases: nemo_automodel.components.checkpoint.state_dict_adapter.StateDictAdapter

Converts between HF Qwen3-VL-MoE checkpoints and grouped-experts native format.

HF checkpoint keys (already stacked, no .weight suffix): model.language_model.layers.{L}.mlp.experts.gate_up_proj [n_experts, dim, 2*inter] model.language_model.layers.{L}.mlp.experts.down_proj [n_experts, inter, dim]

Native format (identical shapes, different key names): model.language_model.layers.{L}.mlp.experts.gate_and_up_projs model.language_model.layers.{L}.mlp.experts.down_projs

Loading paths: DCP path: to_hf renames native→HF, DCP loads into DTensors, from_hf renames HF→native. Tensors are DTensors throughout — just rename keys, no tensor ops. Init path: from_hf receives plain tensors from safetensors, slices to local EP shard, and wraps in DTensor via create_dtensor_from_local.

Initialization

to_hf(
state_dict: dict[str, Any],
exclude_key_regex: Optional[str] = None,
quantization: bool = False,
**kwargs,
) dict[str, Any]#

Rename native keys to HF keys. Tensors passed through as-is (no comms).

from_hf(
hf_state_dict: dict[str, Any],
device_mesh: Optional[torch.distributed.device_mesh.DeviceMesh] = None,
**kwargs,
) dict[str, Any]#

Rename HF keys to native keys.

DTensors (DCP path): just rename, no tensor ops. Plain tensors (init path): slice to local EP shard, create DTensor.

convert_single_tensor_to_hf(
fqn: str,
tensor: Any,
**kwargs,
) list[tuple[str, Any]]#

Rename a single native key to HF format. Tensor passed through as-is.