`nemo_automodel.components.models.nemotron_omni.model`#

NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) custom model for Nemo Automodel.

This model is a VLM (vision-language model) with:

Vision encoder: RADIO v2.5-H (ViT-Huge, patch_size=16) – loaded from HF
Audio encoder: Parakeet (FastConformer-based) – loaded from HF
LLM: NemotronH (hybrid Mamba+Attention MoE) – reuses nemotron_v3 custom implementation
Projectors: MLP projectors for vision->LLM and audio->LLM

Architecture name: “NemotronH_Nano_Omni_Reasoning_V3” (from config.json)

Module Contents#

Classes#

`SquaredReLU`	Squared ReLU activation: ReLU(x)^2.
`RMSNorm`	Root Mean Square Layer Normalization.
`VisionProjector`	MLP projector from vision encoder to LLM hidden space.
`SoundProjection`	MLP projector from sound encoder to LLM hidden space.
`NemotronOmniConfig`	Configuration for the NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) model.
`_ModelProxy`	Thin proxy so the MoE parallelizer can navigate model.model.moe_config and model.model -> get_text_module -> .layers without changing the weight hierarchy.
`NemotronOmniForConditionalGeneration`	NemotronOmni VLM model for conditional generation (training).

Data#

`logger`
`ModelClass`

API#

nemo_automodel.components.models.nemotron_omni.model.logger#: ‘getLogger(…)’

class nemo_automodel.components.models.nemotron_omni.model.SquaredReLU#

Bases: torch.nn.Module

Squared ReLU activation: ReLU(x)^2.

forward(x: torch.Tensor) → torch.Tensor#

class nemo_automodel.components.models.nemotron_omni.model.RMSNorm(hidden_size: int, eps: float = 1e-05)#

Bases: torch.nn.Module

Root Mean Square Layer Normalization.

Initialization

forward(hidden_states: torch.Tensor) → torch.Tensor#

class nemo_automodel.components.models.nemotron_omni.model.VisionProjector( vit_hidden_size: int, projector_hidden_size: int, llm_hidden_size: int, downsample_ratio: float = 0.5, )#

Bases: torch.nn.Module

MLP projector from vision encoder to LLM hidden space.

HF checkpoint structure (mlp1): mlp1.0.weight -> RMSNorm weight (vit_hidden_size * pixel_shuffle_factor^2,) mlp1.1.weight -> Linear1 weight (projector_hidden_size, vit_hidden_size * pixel_shuffle_factor^2) mlp1.3.weight -> Linear2 weight (llm_hidden_size, projector_hidden_size)

Between linear1 and linear2 there is a SquaredReLU activation (index 2 in Sequential, but it has no weight).

Initialization

forward(x: torch.Tensor) → torch.Tensor#

class nemo_automodel.components.models.nemotron_omni.model.SoundProjection( sound_hidden_size: int, projection_hidden_size: int, llm_hidden_size: int, bias: bool = False, )#

Bases: torch.nn.Module

MLP projector from sound encoder to LLM hidden space.

HF checkpoint structure: sound_projection.norm.weight -> RMSNorm weight (sound_hidden_size,) sound_projection.linear1.weight -> Linear1 weight (projection_hidden_size, sound_hidden_size) sound_projection.linear2.weight -> Linear2 weight (llm_hidden_size, projection_hidden_size)

Initialization

forward(x: torch.Tensor) → torch.Tensor#

class nemo_automodel.components.models.nemotron_omni.model.NemotronOmniConfig(

vision_config=None,

llm_config=None,

sound_config=None,

force_image_size=512,

downsample_ratio=0.5,

patch_size=16,

template=None,

ps_version='v2',

image_tag_type='internvl',

projector_hidden_size=20480,

vit_hidden_size=1280,

img_context_token_id=18,

video_context_token_id=131081,

sound_context_token_id=27,

video_pruning_rate=0.7,

**kwargs,

)#

Bases: transformers.configuration_utils.PretrainedConfig

Configuration for the NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) model.

This wraps the HF config and provides easy access to sub-configs.

Initialization

model_type#: ‘NemotronH_Nano_Omni_Reasoning_V3’

is_composition#: True

class nemo_automodel.components.models.nemotron_omni.model._ModelProxy( llm: nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM, )#

Thin proxy so the MoE parallelizer can navigate model.model.moe_config and model.model -> get_text_module -> .layers without changing the weight hierarchy.

The parallelizer (parallelizer.py) expects: model.model.moe_config (for expert-count validation) model.model -> get_text_module() (finds language_model attr) -> .layers

By setting self.model = _ModelProxy(self.language_model) on the VLM: model.model.moe_config -> language_model.model.moe_config OK get_text_module(model.model) -> model.model.language_model == language_model.model (NemotronV3Model) -> .layers OK

Initialization

class nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration(

config,

backend: nemo_automodel.components.models.common.BackendConfig | None = None,

**kwargs,

)#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, torch.nn.Module, nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixin

NemotronOmni VLM model for conditional generation (training).

Wraps:

Vision encoder (RADIO v2.5-H) – HF implementation via trust_remote_code
Audio encoder (Parakeet) – HF implementation via trust_remote_code
Vision projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)
Sound projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)
Language model (NemotronH hybrid Mamba+Attention MoE) – nemotron_v3 custom impl

The LLM part reuses the nemotron_v3 implementation (NemotronHForCausalLM) which has custom DTensor parallelism for the Mamba+Attention hybrid MoE architecture.

Initialization

Initialize NemotronOmniForConditionalGeneration.

Parameters:

config – NemotronH_Nano_Omni_Reasoning_V3 config
backend – Backend configuration
**kwargs – Additional arguments

classmethod from_config(

config,

backend: nemo_automodel.components.models.common.BackendConfig | None = None,

**kwargs,

)#

Create model from config.

Parameters:

config – NemotronH_Nano_Omni_Reasoning_V3 config (HF config with trust_remote_code)
backend – Backend configuration
**kwargs – Additional arguments

Returns:

NemotronOmniForConditionalGeneration instance

classmethod from_pretrained(

pretrained_model_name_or_path: str,

*model_args,

**kwargs,

)#

Load pretrained model.

Parameters:

pretrained_model_name_or_path – Path or name of pretrained model
*model_args – Additional positional arguments
**kwargs – Additional keyword arguments

Returns:

NemotronOmniForConditionalGeneration instance

static _make_missing_buffers_non_persistent(module: torch.nn.Module) → None#

Convert persistent buffers that are NOT saved in HF checkpoints to non-persistent buffers.

The RADIO vision encoder registers some buffers (e.g. summary_idxs) as persistent, but the HF checkpoint does not contain them. When the DCP loader builds its load plan it expects every persistent buffer to appear in the checkpoint and raises RuntimeError: Missing key otherwise.

This method re-registers such buffers as non-persistent so they are kept at their init-time values and not expected on disk.

get_input_embeddings()#: Return the input embeddings from the language model.

set_input_embeddings(value)#: Set the input embeddings of the language model.

get_output_embeddings()#: Return the output embeddings (lm_head) from the language model.

set_output_embeddings(new_embeddings)#: Set the output embeddings (lm_head) of the language model.

pixel_shuffle( x: torch.Tensor, scale_factor: float = 0.5, ) → torch.Tensor#

Pixel shuffle for downsampling spatial resolution while increasing channels.

Parameters:

x – Input tensor [N, W, H, C]
scale_factor – Downsampling ratio (default 0.5 = halve spatial dims)

Returns:

Shuffled tensor [N, Wscale, Hscale, C/(scale^2)]

extract_feature(pixel_values: torch.Tensor) → torch.Tensor#

Extract vision features from pixel values through RADIO + projector.

Parameters:: pixel_values – Image tensors [num_tiles, C, H, W]
Returns:: Vision embeddings [num_tiles, num_tokens, llm_hidden_size]

extract_feature_dynamic( pixel_values: torch.Tensor, imgs_sizes: torch.Tensor | list[tuple[int, int]], ) → torch.Tensor#

Dynamic-resolution feature extraction (no tile splitting).

Matches vLLM’s dynamic-resolution vision path for Nano v3 VL / Nemotron-Omni (see 3rdparty/vllm/vllm/model_executor/models/ nano_nemotron_vl.py). Required when the rollout uses DynamicResolutionImageTiler — tile-based extract_feature would produce different embeddings and break rollout/train logprob agreement.

Unlike vLLM’s RADIO port (which supports packed imgs_sizes= inputs), the HF RADIO from nvidia/C-RADIOv2-H only accepts a dense (B, C, H, W) tensor. We crop each padded image back to its real size and run the vision model per-image, then concatenate features.

Parameters:

pixel_values – [num_images, C, H_padded, W_padded] batch of dynamically-resized images padded to the batch max (h, w).
imgs_sizes – [num_images, 2] actual (h, w) per image (torch tensor of ints) or an equivalent list of tuples.

Returns:

Vision embeddings [sum_num_embeddings_after_pixel_shuffle, llm_hidden_size].

_pixel_shuffle_dynamic_res( x: torch.Tensor, imgs_sizes: list[tuple[int, int]], ) → torch.Tensor#

Per-image pixel-shuffle for dynamic-resolution outputs.

Ported from vLLM’s NanoNemotronVLMultimodal.pixel_shuffle_dynamic_res. Splits x along the sequence dim by per-image patch counts, reshapes each split to (N, H_patches, W_patches, C_feat), applies pixel_shuffle with downsample_ratio, and flattens back to a concatenated (N, L’, C).

extract_video_feature( pixel_values_videos: torch.Tensor, ) → torch.Tensor#

Pack T = video_temporal_patch_dim frames into channels and run the ViT.

Returns embeddings shaped like extract_feature output, but with ceil(N_frames / T) rows instead of one row per frame.

extract_sound_feature( input_features: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, ) → torch.Tensor#

Extract and project sound features from audio input.

Parameters:

input_features – Mel spectrogram features [batch, seq_len, feature_dim]
attention_mask – Optional attention mask [batch, seq_len]

Returns:

Sound embeddings projected to LLM hidden size

prepare_model_inputs_for_cp( input_ids: torch.Tensor, pixel_values: Optional[torch.Tensor] = None, image_flags: Optional[torch.Tensor] = None, imgs_sizes: Optional[torch.Tensor] = None, pixel_values_videos: Optional[torch.Tensor] = None, sound_features: Optional[torch.Tensor] = None, sound_attention_mask: Optional[torch.Tensor] = None, ) → dict#

Merge image/video/audio features into text embeddings BEFORE CP sharding.

Under CP > 1 the sequence is sharded; multimodal scatter must run on the full un-sharded sequence so each rank ends up with embeddings that match its local slice of input_ids. Returns a dict so future per-layer inputs can ride alongside inputs_embeds.

prepare_inputs_embeds_for_cp( input_ids: torch.Tensor, pixel_values: Optional[torch.Tensor] = None, image_flags: Optional[torch.Tensor] = None, imgs_sizes: Optional[torch.Tensor] = None, pixel_values_videos: Optional[torch.Tensor] = None, sound_features: Optional[torch.Tensor] = None, sound_attention_mask: Optional[torch.Tensor] = None, ) → torch.Tensor#: Thin wrapper returning just inputs_embeds for callers that don’t need the full prepared-inputs dict.

forward(

pixel_values: Optional[torch.FloatTensor] = None,

input_ids: Optional[torch.LongTensor] = None,

attention_mask: Optional[torch.Tensor] = None,

position_ids: Optional[torch.LongTensor] = None,

image_flags: Optional[torch.LongTensor] = None,

imgs_sizes: Optional[torch.LongTensor] = None,

past_key_values: Optional[List[torch.FloatTensor]] = None,

labels: Optional[torch.LongTensor] = None,

sound_features: Optional[torch.FloatTensor] = None,

sound_attention_mask: Optional[torch.Tensor] = None,

pixel_values_videos: Optional[torch.FloatTensor] = None,

inputs_embeds: Optional[torch.FloatTensor] = None,

use_cache: Optional[bool] = None,

output_attentions: Optional[bool] = None,

output_hidden_states: Optional[bool] = None,

return_dict: Optional[bool] = None,

*,

_pre_embed_only: bool = False,

**kwargs,

) → Union[dict, Tuple, transformers.modeling_outputs.CausalLMOutputWithPast]#

Forward pass for training.

This follows the same pattern as the HF NemotronH_Nano_Omni_Reasoning_V3.forward():

Get text embeddings from LLM embed_tokens
Extract vision features from pixel_values
Replace image token embeddings with vision embeddings
Run LLM forward pass
Compute loss if labels provided

Parameters:

pixel_values – Image pixel values [num_tiles, C, H, W]
input_ids – Input token IDs [batch, seq_len]
attention_mask – Attention mask [batch, seq_len]
position_ids – Position IDs (unused, for API compat)
image_flags – Flags indicating real images vs padding [num_tiles, 1]
labels – Token IDs for loss computation [batch, seq_len]
inputs_embeds – Pre-computed input embeddings (optional)
use_cache – Whether to use caching (not used in training)
**kwargs – Additional arguments

Returns:

CausalLMOutputWithPast with loss and logits

initialize_weights( buffer_device: torch.device | None = None, dtype: torch.dtype = torch.bfloat16, ) → None#

Initialize model weights.

Parameters:

buffer_device – Device to use for buffer initialization
dtype – Target dtype for model weights

nemo_automodel.components.models.nemotron_omni.model.ModelClass#: None

nemo_automodel.components.models.nemotron_omni.model#

Module Contents#

Classes#

Data#

API#

`nemo_automodel.components.models.nemotron_omni.model`#