nemo_automodel.components.models.qwen3_5_moe.state_dict_adapter
nemo_automodel.components.models.qwen3_5_moe.state_dict_adapter
State-dict adapter for Qwen3.5-MoE.
HF Qwen3.5-MoE stores expert weights as aggregated 3-D tensors:
model.language_model.layers.{L}.mlp.experts.gate_up_proj # [n_experts, 2*moe_inter, hidden] model.language_model.layers.{L}.mlp.experts.down_proj # [n_experts, hidden, moe_inter]
NeMo uses a different naming convention and transposed layout (x @ weight):
model.language_model.layers.{L}.mlp.experts.gate_and_up_projs # [n_experts, hidden, 2*moe_inter] model.language_model.layers.{L}.mlp.experts.down_projs # [n_experts, moe_inter, hidden]
Both expert tensors require .transpose(1, 2) when converting between formats.
Additionally, the shared expert uses singular in HF and plural in NeMo:
HF: .mlp.shared_expert.{gate,up,down}_proj.weight NeMo: .mlp.shared_experts.{gate,up,down}_proj.weight
All other keys (attention, linear_attn/GatedDeltaNet, norms, embeddings, vision
encoder) pass through unchanged. The HF VLM checkpoint stores the language
model head as model.lm_head while Automodel registers it on the outer model
as lm_head.
Module Contents
Classes
Functions
API
Bases: StateDictAdapter
Converts between HF Qwen3.5-MoE checkpoints and the NeMo native format.
HF Qwen3.5-MoE stores expert weights as aggregated 3-D tensors:
model.language_model.layers.{L}.mlp.experts.gate_up_proj # [n_experts, 2*moe_inter, hidden] model.language_model.layers.{L}.mlp.experts.down_proj # [n_experts, hidden, moe_inter]
NeMo uses a different naming convention and transposed layout (x @ weight):
model.language_model.layers.{L}.mlp.experts.gate_and_up_projs # [n_experts, hidden, 2*moe_inter] model.language_model.layers.{L}.mlp.experts.down_projs # [n_experts, moe_inter, hidden]
Both expert tensors require .transpose(1, 2) when converting between formats.
Additionally, the shared expert uses singular in HF and plural in NeMo:
HF: .mlp.shared_expert.{gate,up,down}_proj.weight NeMo: .mlp.shared_experts.{gate,up,down}_proj.weight
Apply key substring mappings to state dict keys.
Rename a single native key to HF format and transpose expert tensors.
Rename HF keys to native keys and transpose expert tensors.
DTensors (DCP path): rename + transpose, no slicing — DCP handles sharding. Plain tensors (init path): slice to local EP shard, transpose, create DTensor.
Rename native keys to HF keys and transpose expert tensors. No comms needed.
Route bare GDN fp32 params into the holder used by the native module.
Strip the fp32 holder segment from GDN state-dict keys.