nemo_automodel.components.models.nemotron_v3.state_dict_adapter#

Module Contents#

Classes#

NemotronV3StateDictAdapter

State dict adapter for NemotronV3 models.

Data#

API#

nemo_automodel.components.models.nemotron_v3.state_dict_adapter.logger#

‘getLogger(…)’

class nemo_automodel.components.models.nemotron_v3.state_dict_adapter.NemotronV3StateDictAdapter(
config,
moe_config: nemo_automodel.components.moe.config.MoEConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
dtype: torch.dtype = torch.bfloat16,
)#

Bases: nemo_automodel.components.moe.state_dict_mixin.MoESplitExpertsStateDictMixin, nemo_automodel.components.checkpoint.state_dict_adapter.StateDictAdapter

State dict adapter for NemotronV3 models.

Converts between HuggingFace checkpoint format and internal NeMo format.

HF format uses ‘backbone’ prefix: - backbone.embed_tokens.weight - backbone.layers.{}.norm.weight - backbone.layers.{}.mixer.* (mamba/attention/moe components) - backbone.norm_f.weight - lm_head.weight

Internal format uses ‘model’ prefix: - model.embed_tokens.weight - model.layers.{}.norm.weight - model.layers.{}.mixer.* (mamba/attention/moe components) - model.norm.weight - lm_head.weight

For MoE layers: - HF: Split per-expert weights (experts.{}.up_proj.weight, experts.{}.down_proj.weight) - Internal: Merged expert weights (experts.gate_and_up_projs, experts.down_projs)

NemotronV3 uses ReLU² activation (non-gated), so gate_and_up_projs has shape [n_experts, dim, inter_dim] instead of [n_experts, dim, 2*inter_dim].

Note: NemotronV3 uses ‘mixer’ instead of ‘mlp’ in layer paths.

Initialization

property _hf_prefix: str#

NemotronV3 HF format uses ‘backbone.’ prefix.

property _expert_path_segment: str#

NemotronV3 uses ‘mixer.experts’ instead of ‘mlp.experts’.

to_hf(
state_dict: dict[str, Any],
exclude_key_regex: Optional[str] = None,
**kwargs,
) dict[str, Any]#

Convert from internal model state dict to HuggingFace format.

Parameters:
  • state_dict – Internal format state dict

  • exclude_key_regex – Optional regex pattern to exclude keys

  • **kwargs – Additional arguments

Returns:

HuggingFace format state dict

from_hf(
hf_state_dict: dict[str, Any],
device_mesh: Optional[torch.distributed.device_mesh.DeviceMesh] = None,
**kwargs,
) dict[str, Any]#

Convert HF checkpoint to internal format.

  • Rename backbone → model

  • Rename norm_f → norm

  • Aggregate per-expert weights into grouped tensors

  • If device_mesh is provided, only load experts needed for the current rank

Parameters:
  • hf_state_dict – HuggingFace format state dict

  • device_mesh – Optional device mesh for distributed expert loading

  • **kwargs – Additional arguments

Returns:

Internal format state dict

convert_single_tensor_to_hf(
fqn: str,
tensor: Any,
**kwargs,
) list[tuple[str, Any]]#

Convert a single tensor from internal format to HuggingFace format.

Parameters:
  • fqn – Fully qualified name of the tensor in internal format

  • tensor – The tensor to convert

  • **kwargs – Additional arguments for conversion

Returns:

List of (fqn, tensor) tuples in HuggingFace format