nemo_automodel.components.models.ling_v2.state_dict_adapter#

HF <-> NeMo state-dict adapter for BailingMoeV2 (Ling 2.0).

Handles the rename map between the HuggingFace checkpoint layout

model.word_embeddings.weight
model.layers.{N}.attention.query_key_value.weight      # fused [Q | K | V]
model.layers.{N}.attention.dense.weight
model.layers.{N}.attention.query_layernorm.weight
model.layers.{N}.attention.key_layernorm.weight
model.layers.{N}.mlp.gate.weight
model.layers.{N}.mlp.gate.expert_bias
model.layers.{N}.mlp.experts.{E}.{gate_proj,up_proj,down_proj}.weight
model.layers.{N}.mlp.shared_experts.{gate_proj,up_proj,down_proj}.weight

and the native NeMo layout used by this package

model.embed_tokens.weight
model.layers.{N}.self_attn.{q_proj,k_proj,v_proj,o_proj}.weight
model.layers.{N}.self_attn.{q_norm,k_norm}.weight
model.layers.{N}.mlp.gate.weight
model.layers.{N}.mlp.gate.e_score_correction_bias
model.layers.{N}.mlp.experts.{gate_and_up_projs,down_projs}
model.layers.{N}.mlp.shared_experts.{gate_proj,up_proj,down_proj}.weight

The per-expert grouping is delegated to MoESplitExpertsStateDictMixin; this adapter only normalises the surrounding key names and splits the fused QKV.

Module Contents#

Classes#

BailingMoeV2StateDictAdapter

State-dict adapter for BailingMoeV2 / Ling 2.0 checkpoints.

Functions#

Data#

API#

nemo_automodel.components.models.ling_v2.state_dict_adapter._RENAME_PAIRS_HF_TO_NATIVE: tuple[tuple[str, str], ...]#

((‘model.word_embeddings.’, ‘model.embed_tokens.’), (‘.attention.dense.’, ‘.self_attn.o_proj.’), (’….

nemo_automodel.components.models.ling_v2.state_dict_adapter._LAYER_QKV_RE#

‘compile(…)’

nemo_automodel.components.models.ling_v2.state_dict_adapter._rename_hf_to_native(key: str) str#
nemo_automodel.components.models.ling_v2.state_dict_adapter._rename_native_to_hf(key: str) str#
class nemo_automodel.components.models.ling_v2.state_dict_adapter.BailingMoeV2StateDictAdapter(
config: nemo_automodel.components.models.ling_v2.config.BailingMoeV2Config,
moe_config: nemo_automodel.components.moe.config.MoEConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
dtype: torch.dtype = torch.bfloat16,
)#

Bases: nemo_automodel.components.moe.state_dict_mixin.MoESplitExpertsStateDictMixin, nemo_automodel.components.checkpoint.state_dict_adapter.StateDictAdapter

State-dict adapter for BailingMoeV2 / Ling 2.0 checkpoints.

Initialization

from_hf(
hf_state_dict: dict[str, Any],
device_mesh: Optional[torch.distributed.device_mesh.DeviceMesh] = None,
**kwargs,
) dict[str, Any]#
_split_fused_qkv_and_rename(
hf_state_dict: dict[str, Any],
) dict[str, Any]#

Split each fused query_key_value weight into q/k/v and apply renames.

to_hf(
state_dict: dict[str, Any],
exclude_key_regex: Optional[str] = None,
quantization: bool = False,
**kwargs,
) dict[str, Any]#
convert_single_tensor_to_hf(
fqn: str,
tensor: Any,
**kwargs,
) list[tuple[str, Any]]#

Convert a single native tensor to HuggingFace format.

q_proj / k_proj / v_proj tensors cannot be re-fused without their two siblings; the caller should batch them through :meth:to_hf instead. This single-tensor path emits the per-projection HF key (which is not the standard fused name) so that the value is not silently dropped during DCP save adapters that walk tensors one-by-one.