nemo_automodel.components.models.nemotron_v3.state_dict_adapter

View as Markdown

Module Contents

Classes

NameDescription
NemotronV3StateDictAdapterState dict adapter for NemotronV3 models.

Data

logger

API

class nemo_automodel.components.models.nemotron_v3.state_dict_adapter.NemotronV3StateDictAdapter(
config,
moe_config: nemo_automodel.components.moe.config.MoEConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
dtype: torch.dtype = torch.bfloat16
)

Bases: MoESplitExpertsStateDictMixin, StateDictAdapter

State dict adapter for NemotronV3 models.

Converts between HuggingFace checkpoint format and internal NeMo format.

HF format uses β€˜backbone’ prefix:

  • backbone.embed_tokens.weight
  • backbone.layers.{}.norm.weight
  • backbone.layers.{}.mixer.* (mamba/attention/moe components)
  • backbone.norm_f.weight
  • lm_head.weight

Internal format uses β€˜model’ prefix:

  • model.embed_tokens.weight
  • model.layers.{}.norm.weight
  • model.layers.{}.mixer.* (mamba/attention/moe components)
  • model.norm.weight
  • lm_head.weight

NemotronV3 uses ReLUΒ² activation (non-gated), so gate_and_up_projs has shape [n_experts, dim, inter_dim] instead of [n_experts, dim, 2*inter_dim].

Note: NemotronV3 uses β€˜mixer’ instead of β€˜mlp’ in layer paths.

_expert_path_segment
str

NemotronV3 uses β€˜mixer.experts’ instead of β€˜mlp.experts’.

_hf_prefix
str

NemotronV3 HF format uses β€˜backbone.’ prefix.

from_hf_map
nemo_automodel.components.models.nemotron_v3.state_dict_adapter.NemotronV3StateDictAdapter.convert_single_tensor_to_hf(
fqn: str,
tensor: typing.Any,
kwargs = {}
) -> list[tuple[str, typing.Any]]

Convert a single tensor from internal format to HuggingFace format.

Parameters:

fqn
str

Fully qualified name of the tensor in internal format

tensor
Any

The tensor to convert

**kwargs
Defaults to {}

Additional arguments for conversion

Returns: list[tuple[str, Any]]

List of (fqn, tensor) tuples in HuggingFace format

nemo_automodel.components.models.nemotron_v3.state_dict_adapter.NemotronV3StateDictAdapter.from_hf(
hf_state_dict: dict[str, typing.Any],
device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None,
kwargs = {}
) -> dict[str, typing.Any]

Convert HF checkpoint to internal format.

  • Rename backbone β†’ model
  • Rename norm_f β†’ norm
  • Aggregate per-expert weights into grouped tensors
  • If device_mesh is provided, only load experts needed for the current rank
  • Process MTP keys (mtp.layers.{i}.*) separately, reusing the same MoE expert-merge logic for the MoE sublayer of each MTP depth.

Parameters:

hf_state_dict
dict[str, Any]

HuggingFace format state dict

device_mesh
Optional[DeviceMesh]Defaults to None

Optional device mesh for distributed expert loading

**kwargs
Defaults to {}

Additional arguments

Returns: dict[str, Any]

Internal format state dict

nemo_automodel.components.models.nemotron_v3.state_dict_adapter.NemotronV3StateDictAdapter.to_hf(
state_dict: dict[str, typing.Any],
exclude_key_regex: typing.Optional[str] = None,
kwargs = {}
) -> dict[str, typing.Any]

Convert from internal model state dict to HuggingFace format.

Parameters:

state_dict
dict[str, Any]

Internal format state dict

exclude_key_regex
Optional[str]Defaults to None

Optional regex pattern to exclude keys

**kwargs
Defaults to {}

Additional arguments

Returns: dict[str, Any]

HuggingFace format state dict

nemo_automodel.components.models.nemotron_v3.state_dict_adapter.logger = logging.getLogger(__name__)