nemo_automodel.components.models.nemotron_omni.model

NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) custom model for Nemo Automodel.

This model is a VLM (vision-language model) with:

Vision encoder: RADIO v2.5-H (ViT-Huge, patch_size=16) — loaded from HF
Audio encoder: Parakeet (FastConformer-based) — loaded from HF
LLM: NemotronH (hybrid Mamba+Attention MoE) — reuses nemotron_v3 custom implementation
Projectors: MLP projectors for vision->LLM and audio->LLM

Architecture name: “NemotronH_Nano_Omni_Reasoning_V3” (from config.json)

Module Contents

Classes

Name	Description
`NemotronOmniConfig`	Configuration for the NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) model.
`NemotronOmniForConditionalGeneration`	NemotronOmni VLM model for conditional generation (training).
`RMSNorm`	Root Mean Square Layer Normalization.
`SoundProjection`	MLP projector from sound encoder to LLM hidden space.
`SquaredReLU`	Squared ReLU activation: ReLU(x)^2.
`VisionProjector`	MLP projector from vision encoder to LLM hidden space.
`_ModelProxy`	Thin proxy so the MoE parallelizer can navigate model.model.moe_config

Data

ModelClass

logger

API

class nemo_automodel.components.models.nemotron_omni.model.NemotronOmniConfig(
    vision_config = None,
    llm_config = None,
    sound_config = None,
    force_image_size = 512,
    downsample_ratio = 0.5,
    patch_size = 16,
    template = None,
    ps_version = 'v2',
    image_tag_type = 'internvl',
    projector_hidden_size = 20480,
    vit_hidden_size = 1280,
    img_context_token_id = 18,
    video_context_token_id = 131081,
    sound_context_token_id = 27,
    video_pruning_rate = 0.7,
    kwargs = {}
)

Bases: PretrainedConfig

Configuration for the NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) model.

This wraps the HF config and provides easy access to sub-configs.

model_type

= 'NemotronH_Nano_Omni_Reasoning_V3'

class nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)

Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin

NemotronOmni VLM model for conditional generation (training).

Wraps:

Vision encoder (RADIO v2.5-H) — HF implementation via trust_remote_code
Audio encoder (Parakeet) — HF implementation via trust_remote_code
Vision projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)
Sound projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)
Language model (NemotronH hybrid Mamba+Attention MoE) — nemotron_v3 custom impl

The LLM part reuses the nemotron_v3 implementation (NemotronHForCausalLM) which has custom DTensor parallelism for the Mamba+Attention hybrid MoE architecture.

backend

= backend or BackendConfig()

downsample_ratio

= getattr(config, 'downsample_ratio', 0.5)

force_image_size

= getattr(config, 'force_image_size', 512)

img_context_token_id

= getattr(config, 'img_context_token_id', 18)

language_model

model

= _ModelProxy(self.language_model)

num_image_token

patch_size

= getattr(config, 'patch_size', 16)

ps_version

= getattr(config, 'ps_version', 'v2')

sound_context_token_id

= getattr(config, 'sound_context_token_id', 27)

sound_encoder

= ParakeetEncoder(parakeet_config).to(dtype)

sound_projection

state_dict_adapter

video_context_token_id

= getattr(config, 'video_context_token_id', 131081)

video_temporal_patch_dim

= getattr(config, 'video_temporal_patch_size', None)

vision_model

= self.vision_model.to(dtype)

vision_projector

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration._make_missing_buffers_non_persistent(
    module: torch.nn.Module
) -> None

staticmethod

Convert persistent buffers that are NOT saved in HF checkpoints to non-persistent buffers.

The RADIO vision encoder registers some buffers (e.g. summary_idxs) as persistent, but the HF checkpoint does not contain them. When the DCP loader builds its load plan it expects every persistent buffer to appear in the checkpoint and raises RuntimeError: Missing key otherwise.

This method re-registers such buffers as non-persistent so they are kept at their init-time values and not expected on disk.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration._pixel_shuffle_dynamic_res(
    x: torch.Tensor,
    imgs_sizes: list[tuple[int, int]]
) -> torch.Tensor

Per-image pixel-shuffle for dynamic-resolution outputs.

Ported from vLLM’s NanoNemotronVLMultimodal.pixel_shuffle_dynamic_res. Splits x along the sequence dim by per-image patch counts, reshapes each split to (N, H_patches, W_patches, C_feat), applies pixel_shuffle with downsample_ratio, and flattens back to a concatenated (N, L’, C).

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.extract_feature(
    pixel_values: torch.Tensor
) -> torch.Tensor

Extract vision features from pixel values through RADIO + projector.

Parameters:

pixel_values

torch.Tensor

Image tensors [num_tiles, C, H, W]

Returns: torch.Tensor

Vision embeddings [num_tiles, num_tokens, llm_hidden_size]

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.extract_feature_dynamic(
    pixel_values: torch.Tensor,
    imgs_sizes: torch.Tensor | list[tuple[int, int]]
) -> torch.Tensor

Dynamic-resolution feature extraction (no tile splitting).

Matches vLLM’s dynamic-resolution vision path for Nano v3 VL / Nemotron-Omni (see 3rdparty/vllm/vllm/model_executor/models/ nano_nemotron_vl.py). Required when the rollout uses DynamicResolutionImageTiler — tile-based extract_feature would produce different embeddings and break rollout/train logprob agreement.

Unlike vLLM’s RADIO port (which supports packed imgs_sizes= inputs), the HF RADIO from nvidia/C-RADIOv2-H only accepts a dense (B, C, H, W) tensor. We crop each padded image back to its real size and run the vision model per-image, then concatenate features.

Parameters:

pixel_values

torch.Tensor

[num_images, C, H_padded, W_padded] batch of dynamically-resized images padded to the batch max (h, w).

imgs_sizes

torch.Tensor | list[tuple[int, int]]

[num_images, 2] actual (h, w) per image (torch tensor of ints) or an equivalent list of tuples.

Returns: torch.Tensor

Vision embeddings [sum_num_embeddings_after_pixel_shuffle,

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.extract_sound_feature(
    input_features: torch.Tensor,
    attention_mask: typing.Optional[torch.Tensor] = None
) -> torch.Tensor

Extract and project sound features from audio input.

Parameters:

input_features

torch.Tensor

Mel spectrogram features [batch, seq_len, feature_dim]

attention_mask

Optional[torch.Tensor]Defaults to None

Optional attention mask [batch, seq_len]

Returns: torch.Tensor

Sound embeddings projected to LLM hidden size

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.extract_video_feature(
    pixel_values_videos: torch.Tensor
) -> torch.Tensor

Pack T = video_temporal_patch_dim frames into channels and run the ViT.

Returns embeddings shaped like extract_feature output, but with ceil(N_frames / T) rows instead of one row per frame.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.forward(
    pixel_values: typing.Optional[torch.FloatTensor] = None,
    input_ids: typing.Optional[torch.LongTensor] = None,
    attention_mask: typing.Optional[torch.Tensor] = None,
    position_ids: typing.Optional[torch.LongTensor] = None,
    image_flags: typing.Optional[torch.LongTensor] = None,
    imgs_sizes: typing.Optional[torch.LongTensor] = None,
    past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None,
    labels: typing.Optional[torch.LongTensor] = None,
    sound_features: typing.Optional[torch.FloatTensor] = None,
    sound_attention_mask: typing.Optional[torch.Tensor] = None,
    pixel_values_videos: typing.Optional[torch.FloatTensor] = None,
    inputs_embeds: typing.Optional[torch.FloatTensor] = None,
    use_cache: typing.Optional[bool] = None,
    output_attentions: typing.Optional[bool] = None,
    output_hidden_states: typing.Optional[bool] = None,
    return_dict: typing.Optional[bool] = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    _pre_embed_only: bool = False,
    kwargs = {}
) -> typing.Union[dict, typing.Tuple, transformers.modeling_outputs.CausalLMOutputWithPast]

Forward pass for training.

This follows the same pattern as the HF NemotronH_Nano_Omni_Reasoning_V3.forward():

Get text embeddings from LLM embed_tokens
Extract vision features from pixel_values
Replace image token embeddings with vision embeddings
Run LLM forward pass
Compute loss if labels provided

Parameters:

pixel_values

Optional[torch.FloatTensor]Defaults to None

Image pixel values [num_tiles, C, H, W]

input_ids

Optional[torch.LongTensor]Defaults to None

Input token IDs [batch, seq_len]

attention_mask

Optional[torch.Tensor]Defaults to None

Attention mask [batch, seq_len]

position_ids

Optional[torch.LongTensor]Defaults to None

Position IDs (unused, for API compat)

image_flags

Optional[torch.LongTensor]Defaults to None

Flags indicating real images vs padding [num_tiles, 1]

labels

Optional[torch.LongTensor]Defaults to None

Token IDs for loss computation [batch, seq_len]

inputs_embeds

Optional[torch.FloatTensor]Defaults to None

Pre-computed input embeddings (optional)

use_cache

Optional[bool]Defaults to None

Whether to use caching (not used in training)

output_hidden_states

Optional[bool]Defaults to None

Whether the returned output should carry the final decoder hidden states (required for fused linear cross-entropy / cut-CE). Defaults to the text sub-config’s output_hidden_states when None.

logits_to_keep

Union[int, torch.Tensor]Defaults to 0

If 0 (default), compute logits for all positions; if > 0, only compute logits for the last logits_to_keep positions (used by fused linear cross-entropy to avoid the full logit matrix). Forwarded to the language-model lm_head gating.

**kwargs

Defaults to {}

Additional arguments

Returns: Union[dict, Tuple, CausalLMOutputWithPast]

CausalLMOutputWithPast with loss and logits

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.from_config(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)

classmethod

Create model from config.

Parameters:

config

NemotronH_Nano_Omni_Reasoning_V3 config (HF config with trust_remote_code)

backend

BackendConfig | NoneDefaults to None

Backend configuration

**kwargs

Defaults to {}

Additional arguments

Returns:

NemotronOmniForConditionalGeneration instance

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)

classmethod

Load pretrained model.

Parameters:

pretrained_model_name_or_path

str

Path or name of pretrained model

*model_args

Defaults to ()

Additional positional arguments

**kwargs

Defaults to {}

Additional keyword arguments

Returns:

NemotronOmniForConditionalGeneration instance

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.get_input_embeddings()

Return the input embeddings from the language model.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.get_output_embeddings()

Return the output embeddings (lm_head) from the language model.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None

Initialize model weights.

Parameters:

buffer_device

torch.device | NoneDefaults to None

Device to use for buffer initialization

dtype

torch.dtypeDefaults to torch.bfloat16

Target dtype for model weights

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.pixel_shuffle(
    x: torch.Tensor,
    scale_factor: float = 0.5
) -> torch.Tensor

Pixel shuffle for downsampling spatial resolution while increasing channels.

Parameters:

torch.Tensor

Input tensor [N, W, H, C]

scale_factor

floatDefaults to 0.5

Downsampling ratio (default 0.5 = halve spatial dims)

Returns: torch.Tensor

Shuffled tensor [N, Wscale, Hscale, C/(scale^2)]

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.prepare_inputs_embeds_for_cp(
    input_ids: torch.Tensor,
    pixel_values: typing.Optional[torch.Tensor] = None,
    image_flags: typing.Optional[torch.Tensor] = None,
    imgs_sizes: typing.Optional[torch.Tensor] = None,
    pixel_values_videos: typing.Optional[torch.Tensor] = None,
    sound_features: typing.Optional[torch.Tensor] = None,
    sound_attention_mask: typing.Optional[torch.Tensor] = None
) -> torch.Tensor

Thin wrapper returning just inputs_embeds for callers that don’t need the full prepared-inputs dict.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.prepare_model_inputs_for_cp(
    input_ids: torch.Tensor,
    pixel_values: typing.Optional[torch.Tensor] = None,
    image_flags: typing.Optional[torch.Tensor] = None,
    imgs_sizes: typing.Optional[torch.Tensor] = None,
    pixel_values_videos: typing.Optional[torch.Tensor] = None,
    sound_features: typing.Optional[torch.Tensor] = None,
    sound_attention_mask: typing.Optional[torch.Tensor] = None
) -> dict

Merge image/video/audio features into text embeddings BEFORE CP sharding.

Under CP > 1 the sequence is sharded; multimodal scatter must run on the full un-sharded sequence so each rank ends up with embeddings that match its local slice of input_ids. Returns a dict so future per-layer inputs can ride alongside inputs_embeds.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.set_input_embeddings(
    value
)

Set the input embeddings of the language model.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.set_output_embeddings(
    new_embeddings
)

Set the output embeddings (lm_head) of the language model.

class nemo_automodel.components.models.nemotron_omni.model.RMSNorm(
    hidden_size: int,
    eps: float = 1e-05
)

Bases: Module

Root Mean Square Layer Normalization.

weight

= nn.Parameter(torch.ones(hidden_size))

nemo_automodel.components.models.nemotron_omni.model.RMSNorm.forward(
    hidden_states: torch.Tensor
) -> torch.Tensor

class nemo_automodel.components.models.nemotron_omni.model.SoundProjection(
    sound_hidden_size: int,
    projection_hidden_size: int,
    llm_hidden_size: int,
    bias: bool = False
)

Bases: Module

MLP projector from sound encoder to LLM hidden space.

activation

= SquaredReLU()

linear1

linear2

norm

= RMSNorm(sound_hidden_size, eps=1e-05)

nemo_automodel.components.models.nemotron_omni.model.SoundProjection.forward(
    x: torch.Tensor
) -> torch.Tensor

class nemo_automodel.components.models.nemotron_omni.model.SquaredReLU()

Bases: Module

Squared ReLU activation: ReLU(x)^2.

nemo_automodel.components.models.nemotron_omni.model.SquaredReLU.forward(
    x: torch.Tensor
) -> torch.Tensor

class nemo_automodel.components.models.nemotron_omni.model.VisionProjector(
    vit_hidden_size: int,
    projector_hidden_size: int,
    llm_hidden_size: int,
    downsample_ratio: float = 0.5
)

Bases: Module

MLP projector from vision encoder to LLM hidden space.

HF checkpoint structure (mlp1): mlp1.0.weight -> RMSNorm weight (vit_hidden_size * pixel_shuffle_factor^2,) mlp1.1.weight -> Linear1 weight (projector_hidden_size, vit_hidden_size * pixel_shuffle_factor^2) mlp1.3.weight -> Linear2 weight (llm_hidden_size, projector_hidden_size)

Between linear1 and linear2 there is a SquaredReLU activation (index 2 in Sequential, but it has no weight).

activation

= SquaredReLU()

linear1

linear2

norm

= RMSNorm(pixel_shuffle_channels, eps=1e-05)

nemo_automodel.components.models.nemotron_omni.model.VisionProjector.forward(
    x: torch.Tensor
) -> torch.Tensor

class nemo_automodel.components.models.nemotron_omni.model._ModelProxy(
    llm: nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM
)

Thin proxy so the MoE parallelizer can navigate model.model.moe_config and model.model -> get_text_module -> .layers without changing the weight hierarchy.

The parallelizer (parallelizer.py) expects: model.model.moe_config (for expert-count validation) model.model -> get_text_module() (finds language_model attr) -> .layers

By setting self.model = _ModelProxy(self.language_model) on the VLM: model.model.moe_config -> language_model.model.moe_config OK get_text_module(model.model) -> model.model.language_model == language_model.model (NemotronV3Model) -> .layers OK

language_model

= llm.model

moe_config

= llm.model.moe_config

nemo_automodel.components.models.nemotron_omni.model.ModelClass = NemotronOmniForConditionalGeneration

nemo_automodel.components.models.nemotron_omni.model.logger = logging.getLogger(__name__)