nemo_automodel.components.models.nemotron_v3.state_dict_adapter

Module Contents

Classes

Name	Description
`NemotronV3StateDictAdapter`	State dict adapter for NemotronV3 models.

Data

logger

API

class nemo_automodel.components.models.nemotron_v3.state_dict_adapter.NemotronV3StateDictAdapter(
    config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    dtype: torch.dtype = torch.bfloat16
)

Bases: MoESplitExpertsStateDictMixin, StateDictAdapter

State dict adapter for NemotronV3 models.

Converts between HuggingFace checkpoint format and internal NeMo format.

HF format uses ‘backbone’ prefix:

backbone.embed_tokens.weight
backbone.layers.{}.norm.weight
backbone.layers.{}.mixer.* (mamba/attention/moe components)
backbone.norm_f.weight
lm_head.weight

Internal format uses ‘model’ prefix:

model.embed_tokens.weight
model.layers.{}.norm.weight
model.layers.{}.mixer.* (mamba/attention/moe components)
model.norm.weight
lm_head.weight

NemotronV3 uses ReLU² activation (non-gated), so gate_and_up_projs has shape [n_experts, dim, inter_dim] instead of [n_experts, dim, 2*inter_dim].

Note: NemotronV3 uses ‘mixer’ instead of ‘mlp’ in layer paths.

_expert_path_segment

str

NemotronV3 uses ‘mixer.experts’ instead of ‘mlp.experts’.

_hf_prefix

str

NemotronV3 HF format uses ‘backbone.’ prefix.

from_hf_map

nemo_automodel.components.models.nemotron_v3.state_dict_adapter.NemotronV3StateDictAdapter.convert_single_tensor_to_hf(
    fqn: str,
    tensor: typing.Any,
    kwargs = {}
) -> list[tuple[str, typing.Any]]

Convert a single tensor from internal format to HuggingFace format.

Parameters:

fqn

str

Fully qualified name of the tensor in internal format

tensor

Any

The tensor to convert

**kwargs

Defaults to {}

Additional arguments for conversion

Returns: list[tuple[str, Any]]

List of (fqn, tensor) tuples in HuggingFace format

nemo_automodel.components.models.nemotron_v3.state_dict_adapter.NemotronV3StateDictAdapter.from_hf(
    hf_state_dict: dict[str, typing.Any],
    device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None,
    kwargs = {}
) -> dict[str, typing.Any]

Convert HF checkpoint to internal format.

Rename backbone → model
Rename norm_f → norm
Aggregate per-expert weights into grouped tensors
If device_mesh is provided, only load experts needed for the current rank
Process MTP keys (mtp.layers.{i}.*) separately, reusing the same MoE expert-merge logic for the MoE sublayer of each MTP depth.

Parameters:

hf_state_dict

dict[str, Any]

HuggingFace format state dict

device_mesh

Optional[DeviceMesh]Defaults to None

Optional device mesh for distributed expert loading

**kwargs

Defaults to {}

Additional arguments

Returns: dict[str, Any]

Internal format state dict

nemo_automodel.components.models.nemotron_v3.state_dict_adapter.NemotronV3StateDictAdapter.to_hf(
    state_dict: dict[str, typing.Any],
    exclude_key_regex: typing.Optional[str] = None,
    kwargs = {}
) -> dict[str, typing.Any]

Convert from internal model state dict to HuggingFace format.

Parameters:

state_dict

dict[str, Any]

Internal format state dict

exclude_key_regex

Optional[str]Defaults to None

Optional regex pattern to exclude keys

**kwargs

Defaults to {}

Additional arguments

Returns: dict[str, Any]

HuggingFace format state dict

nemo_automodel.components.models.nemotron_v3.state_dict_adapter.logger = logging.getLogger(__name__)