nemo_automodel.components.models.nemotron_omni.state_dict_adapter

View as Markdown

State dict adapter for NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) models.

Converts between HuggingFace checkpoint format and the custom Automodel format.

HF checkpoint key structure (from model.safetensors.index.json):

Vision encoder (RADIO) — loaded as-is into self.vision_model

vision_model.radio_model.model.blocks.{N}.{…} vision_model.radio_model.input_conditioner.norm_mean vision_model.radio_model.input_conditioner.norm_std vision_model.radio_model.model.patch_generator.{…}

Vision projector — loaded into self.vision_projector

HF: mlp1.0.weight (RMSNorm) Custom: vision_projector.norm.weight HF: mlp1.1.weight (Linear1) Custom: vision_projector.linear1.weight HF: mlp1.3.weight (Linear2) Custom: vision_projector.linear2.weight

Sound encoder (Parakeet) — loaded into self.sound_encoder

HF: sound_encoder.encoder.{…} Custom: sound_encoder.{…}

Sound projector — loaded into self.sound_projection

HF: sound_projection.norm.weight Custom: sound_projection.norm.weight HF: sound_projection.linear1.weight Custom: sound_projection.linear1.weight HF: sound_projection.linear2.weight Custom: sound_projection.linear2.weight

LLM (NemotronH) — uses nemotron_v3 state_dict_adapter internally

HF: language_model.backbone.embeddings.weight Custom: language_model.model.embed_tokens.weight HF: language_model.backbone.layers.{N}.{…} Custom: language_model.model.layers.{N}.{…} HF: language_model.backbone.norm_f.weight Custom: language_model.model.norm.weight HF: language_model.lm_head.weight Custom: language_model.lm_head.weight

For MoE layers in the LLM: HF: language_model.backbone.layers.{N}.mixer.experts.{E}.up_proj.weight (split per-expert) HF: language_model.backbone.layers.{N}.mixer.experts.{E}.down_proj.weight Custom: language_model.model.layers.{N}.mixer.experts.gate_and_up_projs (merged) Custom: language_model.model.layers.{N}.mixer.experts.down_projs

Module Contents

Classes

NameDescription
NemotronOmniStateDictAdapterState dict adapter for NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) models.

Data

_VISION_PROJ_CUSTOM_TO_HF

_VISION_PROJ_HF_TO_CUSTOM

logger

API

class nemo_automodel.components.models.nemotron_omni.state_dict_adapter.NemotronOmniStateDictAdapter(
config,
llm_config,
moe_config: nemo_automodel.components.moe.config.MoEConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
dtype: torch.dtype = torch.bfloat16
)

Bases: StateDictAdapter

State dict adapter for NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) models.

Handles conversion between HF checkpoint format and custom Automodel format.

The adapter delegates LLM key conversion to NemotronV3StateDictAdapter (which handles backbone->model renaming, norm_f->norm, embeddings->embed_tokens, and MoE expert merging) and handles vision/audio components directly.

_llm_adapter
nemo_automodel.components.models.nemotron_omni.state_dict_adapter.NemotronOmniStateDictAdapter.convert_single_tensor_to_hf(
fqn: str,
tensor: typing.Any,
kwargs = {}
) -> list[tuple[str, typing.Any]]

Convert a single tensor from custom format to HF format.

Parameters:

fqn
str

Fully qualified name of the tensor

tensor
Any

The tensor to convert

**kwargs
Defaults to {}

Additional arguments

Returns: list[tuple[str, Any]]

List of (fqn, tensor) tuples in HF format

nemo_automodel.components.models.nemotron_omni.state_dict_adapter.NemotronOmniStateDictAdapter.from_hf(
hf_state_dict: dict[str, typing.Any],
device_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None,
kwargs = {}
) -> dict[str, typing.Any]

Convert HF checkpoint state dict to custom Automodel format.

Steps:

  1. Separate HF state dict into: vision_model, mlp1, sound_encoder, sound_projection, language_model components
  2. Convert vision projector keys (mlp1.* -> vision_projector.*)
  3. Convert sound encoder keys (sound_encoder.encoder.* -> sound_encoder.*)
  4. Pass language_model keys through NemotronV3StateDictAdapter
  5. Merge everything back

Parameters:

hf_state_dict
dict[str, Any]

HuggingFace format state dict

device_mesh
Optional[DeviceMesh]Defaults to None

Optional device mesh for distributed expert loading

**kwargs
Defaults to {}

Additional arguments

Returns: dict[str, Any]

Custom format state dict

nemo_automodel.components.models.nemotron_omni.state_dict_adapter.NemotronOmniStateDictAdapter.to_hf(
state_dict: dict[str, typing.Any],
exclude_key_regex: typing.Optional[str] = None,
kwargs = {}
) -> dict[str, typing.Any]

Convert custom Automodel state dict to HF format.

Steps:

  1. Separate state dict into components
  2. Convert vision projector keys back (vision_projector.* -> mlp1.*)
  3. Convert sound encoder keys back (sound_encoder.* -> sound_encoder.encoder.*)
  4. Pass LLM keys through NemotronV3StateDictAdapter.to_hf
  5. Merge everything back

Parameters:

state_dict
dict[str, Any]

Custom format state dict

exclude_key_regex
Optional[str]Defaults to None

Optional regex pattern to exclude keys

**kwargs
Defaults to {}

Additional arguments

Returns: dict[str, Any]

HuggingFace format state dict

nemo_automodel.components.models.nemotron_omni.state_dict_adapter._VISION_PROJ_CUSTOM_TO_HF = {v: k for k, v in (_VISION_PROJ_HF_TO_CUSTOM.items())}
nemo_automodel.components.models.nemotron_omni.state_dict_adapter._VISION_PROJ_HF_TO_CUSTOM = {'mlp1.0.weight': 'vision_projector.norm.weight', 'mlp1.1.weight': 'vision_proje...
nemo_automodel.components.models.nemotron_omni.state_dict_adapter.logger = logging.getLogger(__name__)