`nemo_automodel.components.models.nemotron_v3.state_dict_adapter`#

Module Contents#

Classes#

NemotronV3StateDictAdapter

State dict adapter for NemotronV3 models.

Data#

logger

API#

nemo_automodel.components.models.nemotron_v3.state_dict_adapter.logger#: ‘getLogger(…)’

class nemo_automodel.components.models.nemotron_v3.state_dict_adapter.NemotronV3StateDictAdapter( config, moe_config: nemo_automodel.components.moe.config.MoEConfig, backend: nemo_automodel.components.models.common.BackendConfig, dtype: torch.dtype = torch.bfloat16, )#

Bases: nemo_automodel.components.moe.state_dict_mixin.MoESplitExpertsStateDictMixin, nemo_automodel.components.checkpoint.state_dict_adapter.StateDictAdapter

State dict adapter for NemotronV3 models.

Converts between HuggingFace checkpoint format and internal NeMo format.

HF format uses ‘backbone’ prefix: - backbone.embed_tokens.weight - backbone.layers.{}.norm.weight - backbone.layers.{}.mixer.* (mamba/attention/moe components) - backbone.norm_f.weight - lm_head.weight

Internal format uses ‘model’ prefix: - model.embed_tokens.weight - model.layers.{}.norm.weight - model.layers.{}.mixer.* (mamba/attention/moe components) - model.norm.weight - lm_head.weight

For MoE layers: - HF: Split per-expert weights (experts.{}.up_proj.weight, experts.{}.down_proj.weight) - Internal: Merged expert weights (experts.gate_and_up_projs, experts.down_projs)

NemotronV3 uses ReLU² activation (non-gated), so gate_and_up_projs has shape [n_experts, dim, inter_dim] instead of [n_experts, dim, 2*inter_dim].

Note: NemotronV3 uses ‘mixer’ instead of ‘mlp’ in layer paths.

Initialization

property _hf_prefix: str#: NemotronV3 HF format uses ‘backbone.’ prefix.

property _expert_path_segment: str#: NemotronV3 uses ‘mixer.experts’ instead of ‘mlp.experts’.

to_hf(

state_dict: dict[str, Any],

exclude_key_regex: Optional[str] = None,

**kwargs,

) → dict[str, Any]#

Convert from internal model state dict to HuggingFace format.

Parameters:

state_dict – Internal format state dict
exclude_key_regex – Optional regex pattern to exclude keys
**kwargs – Additional arguments

Returns:

HuggingFace format state dict

from_hf(

hf_state_dict: dict[str, Any],

device_mesh: Optional[torch.distributed.device_mesh.DeviceMesh] = None,

**kwargs,

) → dict[str, Any]#

Convert HF checkpoint to internal format.

Rename backbone → model
Rename norm_f → norm
Aggregate per-expert weights into grouped tensors
If device_mesh is provided, only load experts needed for the current rank

Parameters:

hf_state_dict – HuggingFace format state dict
device_mesh – Optional device mesh for distributed expert loading
**kwargs – Additional arguments

Returns:

Internal format state dict

convert_single_tensor_to_hf(

fqn: str,

tensor: Any,

**kwargs,

) → list[tuple[str, Any]]#

Convert a single tensor from internal format to HuggingFace format.

Parameters:

fqn – Fully qualified name of the tensor in internal format
tensor – The tensor to convert
**kwargs – Additional arguments for conversion

Returns:

List of (fqn, tensor) tuples in HuggingFace format

nemo_automodel.components.models.nemotron_v3.state_dict_adapter#

Module Contents#

Classes#

Data#

API#

`nemo_automodel.components.models.nemotron_v3.state_dict_adapter`#