nemo_automodel.components.models.nemotron_v3.state_dict_adapter#
Module Contents#
Classes#
State dict adapter for NemotronV3 models. |
Data#
API#
- nemo_automodel.components.models.nemotron_v3.state_dict_adapter.logger#
‘getLogger(…)’
- class nemo_automodel.components.models.nemotron_v3.state_dict_adapter.NemotronV3StateDictAdapter(
- config,
- moe_config: nemo_automodel.components.moe.config.MoEConfig,
- backend: nemo_automodel.components.models.common.BackendConfig,
- dtype: torch.dtype = torch.bfloat16,
Bases:
nemo_automodel.components.moe.state_dict_mixin.MoESplitExpertsStateDictMixin,nemo_automodel.components.checkpoint.state_dict_adapter.StateDictAdapterState dict adapter for NemotronV3 models.
Converts between HuggingFace checkpoint format and internal NeMo format.
HF format uses ‘backbone’ prefix: - backbone.embed_tokens.weight - backbone.layers.{}.norm.weight - backbone.layers.{}.mixer.* (mamba/attention/moe components) - backbone.norm_f.weight - lm_head.weight
Internal format uses ‘model’ prefix: - model.embed_tokens.weight - model.layers.{}.norm.weight - model.layers.{}.mixer.* (mamba/attention/moe components) - model.norm.weight - lm_head.weight
For MoE layers: - HF: Split per-expert weights (experts.{}.up_proj.weight, experts.{}.down_proj.weight) - Internal: Merged expert weights (experts.gate_and_up_projs, experts.down_projs)
NemotronV3 uses ReLU² activation (non-gated), so gate_and_up_projs has shape [n_experts, dim, inter_dim] instead of [n_experts, dim, 2*inter_dim].
Note: NemotronV3 uses ‘mixer’ instead of ‘mlp’ in layer paths.
Initialization
- property _hf_prefix: str#
NemotronV3 HF format uses ‘backbone.’ prefix.
- property _expert_path_segment: str#
NemotronV3 uses ‘mixer.experts’ instead of ‘mlp.experts’.
- to_hf(
- state_dict: dict[str, Any],
- exclude_key_regex: Optional[str] = None,
- **kwargs,
Convert from internal model state dict to HuggingFace format.
- Parameters:
state_dict – Internal format state dict
exclude_key_regex – Optional regex pattern to exclude keys
**kwargs – Additional arguments
- Returns:
HuggingFace format state dict
- from_hf(
- hf_state_dict: dict[str, Any],
- device_mesh: Optional[torch.distributed.device_mesh.DeviceMesh] = None,
- **kwargs,
Convert HF checkpoint to internal format.
Rename backbone → model
Rename norm_f → norm
Aggregate per-expert weights into grouped tensors
If device_mesh is provided, only load experts needed for the current rank
- Parameters:
hf_state_dict – HuggingFace format state dict
device_mesh – Optional device mesh for distributed expert loading
**kwargs – Additional arguments
- Returns:
Internal format state dict
- convert_single_tensor_to_hf(
- fqn: str,
- tensor: Any,
- **kwargs,
Convert a single tensor from internal format to HuggingFace format.
- Parameters:
fqn – Fully qualified name of the tensor in internal format
tensor – The tensor to convert
**kwargs – Additional arguments for conversion
- Returns:
List of (fqn, tensor) tuples in HuggingFace format