nemo_automodel.components.models.nemotron_omni.model

View as Markdown

NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) custom model for Nemo Automodel.

This model is a VLM (vision-language model) with:

  • Vision encoder: RADIO v2.5-H (ViT-Huge, patch_size=16) — loaded from HF
  • Audio encoder: Parakeet (FastConformer-based) — loaded from HF
  • LLM: NemotronH (hybrid Mamba+Attention MoE) — reuses nemotron_v3 custom implementation
  • Projectors: MLP projectors for vision->LLM and audio->LLM

Architecture name: “NemotronH_Nano_Omni_Reasoning_V3” (from config.json)

Module Contents

Classes

NameDescription
NemotronOmniConfigConfiguration for the NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) model.
NemotronOmniForConditionalGenerationNemotronOmni VLM model for conditional generation (training).
RMSNormRoot Mean Square Layer Normalization.
SoundProjectionMLP projector from sound encoder to LLM hidden space.
SquaredReLUSquared ReLU activation: ReLU(x)^2.
VisionProjectorMLP projector from vision encoder to LLM hidden space.
_ModelProxyThin proxy so the MoE parallelizer can navigate model.model.moe_config

Data

ModelClass

logger

API

class nemo_automodel.components.models.nemotron_omni.model.NemotronOmniConfig(
vision_config = None,
llm_config = None,
sound_config = None,
force_image_size = 512,
downsample_ratio = 0.5,
patch_size = 16,
template = None,
ps_version = 'v2',
image_tag_type = 'internvl',
projector_hidden_size = 20480,
vit_hidden_size = 1280,
img_context_token_id = 18,
video_context_token_id = 131081,
sound_context_token_id = 27,
video_pruning_rate = 0.7,
kwargs = {}
)

Bases: PretrainedConfig

Configuration for the NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) model.

This wraps the HF config and provides easy access to sub-configs.

model_type
= 'NemotronH_Nano_Omni_Reasoning_V3'
class nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration(
config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)

Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin

NemotronOmni VLM model for conditional generation (training).

Wraps:

  • Vision encoder (RADIO v2.5-H) — HF implementation via trust_remote_code
  • Audio encoder (Parakeet) — HF implementation via trust_remote_code
  • Vision projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)
  • Sound projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)
  • Language model (NemotronH hybrid Mamba+Attention MoE) — nemotron_v3 custom impl

The LLM part reuses the nemotron_v3 implementation (NemotronHForCausalLM) which has custom DTensor parallelism for the Mamba+Attention hybrid MoE architecture.

backend
= backend or BackendConfig()
downsample_ratio
= getattr(config, 'downsample_ratio', 0.5)
force_image_size
= getattr(config, 'force_image_size', 512)
img_context_token_id
= getattr(config, 'img_context_token_id', 18)
language_model
model
= _ModelProxy(self.language_model)
num_image_token
patch_size
= getattr(config, 'patch_size', 16)
ps_version
= getattr(config, 'ps_version', 'v2')
sound_context_token_id
= getattr(config, 'sound_context_token_id', 27)
sound_encoder
= ParakeetEncoder(parakeet_config).to(dtype)
sound_projection
state_dict_adapter
video_context_token_id
= getattr(config, 'video_context_token_id', 131081)
video_temporal_patch_dim
= getattr(config, 'video_temporal_patch_size', None)
vision_model
= self.vision_model.to(dtype)
vision_projector
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration._make_missing_buffers_non_persistent(
module: torch.nn.Module
) -> None
staticmethod

Convert persistent buffers that are NOT saved in HF checkpoints to non-persistent buffers.

The RADIO vision encoder registers some buffers (e.g. summary_idxs) as persistent, but the HF checkpoint does not contain them. When the DCP loader builds its load plan it expects every persistent buffer to appear in the checkpoint and raises RuntimeError: Missing key otherwise.

This method re-registers such buffers as non-persistent so they are kept at their init-time values and not expected on disk.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration._pixel_shuffle_dynamic_res(
x: torch.Tensor,
imgs_sizes: list[tuple[int, int]]
) -> torch.Tensor

Per-image pixel-shuffle for dynamic-resolution outputs.

Ported from vLLM’s NanoNemotronVLMultimodal.pixel_shuffle_dynamic_res. Splits x along the sequence dim by per-image patch counts, reshapes each split to (N, H_patches, W_patches, C_feat), applies pixel_shuffle with downsample_ratio, and flattens back to a concatenated (N, L’, C).

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.extract_feature(
pixel_values: torch.Tensor
) -> torch.Tensor

Extract vision features from pixel values through RADIO + projector.

Parameters:

pixel_values
torch.Tensor

Image tensors [num_tiles, C, H, W]

Returns: torch.Tensor

Vision embeddings [num_tiles, num_tokens, llm_hidden_size]

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.extract_feature_dynamic(
pixel_values: torch.Tensor,
imgs_sizes: torch.Tensor | list[tuple[int, int]]
) -> torch.Tensor

Dynamic-resolution feature extraction (no tile splitting).

Matches vLLM’s dynamic-resolution vision path for Nano v3 VL / Nemotron-Omni (see 3rdparty/vllm/vllm/model_executor/models/ nano_nemotron_vl.py). Required when the rollout uses DynamicResolutionImageTiler — tile-based extract_feature would produce different embeddings and break rollout/train logprob agreement.

Unlike vLLM’s RADIO port (which supports packed imgs_sizes= inputs), the HF RADIO from nvidia/C-RADIOv2-H only accepts a dense (B, C, H, W) tensor. We crop each padded image back to its real size and run the vision model per-image, then concatenate features.

Parameters:

pixel_values
torch.Tensor

[num_images, C, H_padded, W_padded] batch of dynamically-resized images padded to the batch max (h, w).

imgs_sizes
torch.Tensor | list[tuple[int, int]]

[num_images, 2] actual (h, w) per image (torch tensor of ints) or an equivalent list of tuples.

Returns: torch.Tensor

Vision embeddings [sum_num_embeddings_after_pixel_shuffle,

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.extract_sound_feature(
input_features: torch.Tensor,
attention_mask: typing.Optional[torch.Tensor] = None
) -> torch.Tensor

Extract and project sound features from audio input.

Parameters:

input_features
torch.Tensor

Mel spectrogram features [batch, seq_len, feature_dim]

attention_mask
Optional[torch.Tensor]Defaults to None

Optional attention mask [batch, seq_len]

Returns: torch.Tensor

Sound embeddings projected to LLM hidden size

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.extract_video_feature(
pixel_values_videos: torch.Tensor
) -> torch.Tensor

Pack T = video_temporal_patch_dim frames into channels and run the ViT.

Returns embeddings shaped like extract_feature output, but with ceil(N_frames / T) rows instead of one row per frame.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.forward(
pixel_values: typing.Optional[torch.FloatTensor] = None,
input_ids: typing.Optional[torch.LongTensor] = None,
attention_mask: typing.Optional[torch.Tensor] = None,
position_ids: typing.Optional[torch.LongTensor] = None,
image_flags: typing.Optional[torch.LongTensor] = None,
imgs_sizes: typing.Optional[torch.LongTensor] = None,
past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None,
labels: typing.Optional[torch.LongTensor] = None,
sound_features: typing.Optional[torch.FloatTensor] = None,
sound_attention_mask: typing.Optional[torch.Tensor] = None,
pixel_values_videos: typing.Optional[torch.FloatTensor] = None,
inputs_embeds: typing.Optional[torch.FloatTensor] = None,
use_cache: typing.Optional[bool] = None,
output_attentions: typing.Optional[bool] = None,
output_hidden_states: typing.Optional[bool] = None,
return_dict: typing.Optional[bool] = None,
logits_to_keep: typing.Union[int, torch.Tensor] = 0,
_pre_embed_only: bool = False,
kwargs = {}
) -> typing.Union[dict, typing.Tuple, transformers.modeling_outputs.CausalLMOutputWithPast]

Forward pass for training.

This follows the same pattern as the HF NemotronH_Nano_Omni_Reasoning_V3.forward():

  1. Get text embeddings from LLM embed_tokens
  2. Extract vision features from pixel_values
  3. Replace image token embeddings with vision embeddings
  4. Run LLM forward pass
  5. Compute loss if labels provided

Parameters:

pixel_values
Optional[torch.FloatTensor]Defaults to None

Image pixel values [num_tiles, C, H, W]

input_ids
Optional[torch.LongTensor]Defaults to None

Input token IDs [batch, seq_len]

attention_mask
Optional[torch.Tensor]Defaults to None

Attention mask [batch, seq_len]

position_ids
Optional[torch.LongTensor]Defaults to None

Position IDs (unused, for API compat)

image_flags
Optional[torch.LongTensor]Defaults to None

Flags indicating real images vs padding [num_tiles, 1]

labels
Optional[torch.LongTensor]Defaults to None

Token IDs for loss computation [batch, seq_len]

inputs_embeds
Optional[torch.FloatTensor]Defaults to None

Pre-computed input embeddings (optional)

use_cache
Optional[bool]Defaults to None

Whether to use caching (not used in training)

output_hidden_states
Optional[bool]Defaults to None

Whether the returned output should carry the final decoder hidden states (required for fused linear cross-entropy / cut-CE). Defaults to the text sub-config’s output_hidden_states when None.

logits_to_keep
Union[int, torch.Tensor]Defaults to 0

If 0 (default), compute logits for all positions; if > 0, only compute logits for the last logits_to_keep positions (used by fused linear cross-entropy to avoid the full logit matrix). Forwarded to the language-model lm_head gating.

**kwargs
Defaults to {}

Additional arguments

Returns: Union[dict, Tuple, CausalLMOutputWithPast]

CausalLMOutputWithPast with loss and logits

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.from_config(
config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)
classmethod

Create model from config.

Parameters:

config

NemotronH_Nano_Omni_Reasoning_V3 config (HF config with trust_remote_code)

backend
BackendConfig | NoneDefaults to None

Backend configuration

**kwargs
Defaults to {}

Additional arguments

Returns:

NemotronOmniForConditionalGeneration instance

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path: str,
model_args = (),
kwargs = {}
)
classmethod

Load pretrained model.

Parameters:

pretrained_model_name_or_path
str

Path or name of pretrained model

*model_args
Defaults to ()

Additional positional arguments

**kwargs
Defaults to {}

Additional keyword arguments

Returns:

NemotronOmniForConditionalGeneration instance

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.get_input_embeddings()

Return the input embeddings from the language model.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.get_output_embeddings()

Return the output embeddings (lm_head) from the language model.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16
) -> None

Initialize model weights.

Parameters:

buffer_device
torch.device | NoneDefaults to None

Device to use for buffer initialization

dtype
torch.dtypeDefaults to torch.bfloat16

Target dtype for model weights

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.pixel_shuffle(
x: torch.Tensor,
scale_factor: float = 0.5
) -> torch.Tensor

Pixel shuffle for downsampling spatial resolution while increasing channels.

Parameters:

x
torch.Tensor

Input tensor [N, W, H, C]

scale_factor
floatDefaults to 0.5

Downsampling ratio (default 0.5 = halve spatial dims)

Returns: torch.Tensor

Shuffled tensor [N, Wscale, Hscale, C/(scale^2)]

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.prepare_inputs_embeds_for_cp(
input_ids: torch.Tensor,
pixel_values: typing.Optional[torch.Tensor] = None,
image_flags: typing.Optional[torch.Tensor] = None,
imgs_sizes: typing.Optional[torch.Tensor] = None,
pixel_values_videos: typing.Optional[torch.Tensor] = None,
sound_features: typing.Optional[torch.Tensor] = None,
sound_attention_mask: typing.Optional[torch.Tensor] = None
) -> torch.Tensor

Thin wrapper returning just inputs_embeds for callers that don’t need the full prepared-inputs dict.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.prepare_model_inputs_for_cp(
input_ids: torch.Tensor,
pixel_values: typing.Optional[torch.Tensor] = None,
image_flags: typing.Optional[torch.Tensor] = None,
imgs_sizes: typing.Optional[torch.Tensor] = None,
pixel_values_videos: typing.Optional[torch.Tensor] = None,
sound_features: typing.Optional[torch.Tensor] = None,
sound_attention_mask: typing.Optional[torch.Tensor] = None
) -> dict

Merge image/video/audio features into text embeddings BEFORE CP sharding.

Under CP > 1 the sequence is sharded; multimodal scatter must run on the full un-sharded sequence so each rank ends up with embeddings that match its local slice of input_ids. Returns a dict so future per-layer inputs can ride alongside inputs_embeds.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.set_input_embeddings(
value
)

Set the input embeddings of the language model.

nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.set_output_embeddings(
new_embeddings
)

Set the output embeddings (lm_head) of the language model.

class nemo_automodel.components.models.nemotron_omni.model.RMSNorm(
hidden_size: int,
eps: float = 1e-05
)

Bases: Module

Root Mean Square Layer Normalization.

weight
= nn.Parameter(torch.ones(hidden_size))
nemo_automodel.components.models.nemotron_omni.model.RMSNorm.forward(
hidden_states: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.nemotron_omni.model.SoundProjection(
sound_hidden_size: int,
projection_hidden_size: int,
llm_hidden_size: int,
bias: bool = False
)

Bases: Module

MLP projector from sound encoder to LLM hidden space.

activation
= SquaredReLU()
linear1
linear2
norm
= RMSNorm(sound_hidden_size, eps=1e-05)
nemo_automodel.components.models.nemotron_omni.model.SoundProjection.forward(
x: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.nemotron_omni.model.SquaredReLU()

Bases: Module

Squared ReLU activation: ReLU(x)^2.

nemo_automodel.components.models.nemotron_omni.model.SquaredReLU.forward(
x: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.nemotron_omni.model.VisionProjector(
vit_hidden_size: int,
projector_hidden_size: int,
llm_hidden_size: int,
downsample_ratio: float = 0.5
)

Bases: Module

MLP projector from vision encoder to LLM hidden space.

HF checkpoint structure (mlp1): mlp1.0.weight -> RMSNorm weight (vit_hidden_size * pixel_shuffle_factor^2,) mlp1.1.weight -> Linear1 weight (projector_hidden_size, vit_hidden_size * pixel_shuffle_factor^2) mlp1.3.weight -> Linear2 weight (llm_hidden_size, projector_hidden_size)

Between linear1 and linear2 there is a SquaredReLU activation (index 2 in Sequential, but it has no weight).

activation
= SquaredReLU()
linear1
linear2
norm
= RMSNorm(pixel_shuffle_channels, eps=1e-05)
nemo_automodel.components.models.nemotron_omni.model.VisionProjector.forward(
x: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.nemotron_omni.model._ModelProxy(
llm: nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM
)

Thin proxy so the MoE parallelizer can navigate model.model.moe_config and model.model -> get_text_module -> .layers without changing the weight hierarchy.

The parallelizer (parallelizer.py) expects: model.model.moe_config (for expert-count validation) model.model -> get_text_module() (finds language_model attr) -> .layers

By setting self.model = _ModelProxy(self.language_model) on the VLM: model.model.moe_config -> language_model.model.moe_config OK get_text_module(model.model) -> model.model.language_model == language_model.model (NemotronV3Model) -> .layers OK

language_model
= llm.model
moe_config
= llm.model.moe_config
nemo_automodel.components.models.nemotron_omni.model.ModelClass = NemotronOmniForConditionalGeneration
nemo_automodel.components.models.nemotron_omni.model.logger = logging.getLogger(__name__)