nemo_automodel.components.models.nemotron_omni.model#

NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) custom model for Nemo Automodel.

This model is a VLM (vision-language model) with:

  • Vision encoder: RADIO v2.5-H (ViT-Huge, patch_size=16) – loaded from HF

  • Audio encoder: Parakeet (FastConformer-based) – loaded from HF

  • LLM: NemotronH (hybrid Mamba+Attention MoE) – reuses nemotron_v3 custom implementation

  • Projectors: MLP projectors for vision->LLM and audio->LLM

Architecture name: “NemotronH_Nano_Omni_Reasoning_V3” (from config.json)

Module Contents#

Classes#

SquaredReLU

Squared ReLU activation: ReLU(x)^2.

RMSNorm

Root Mean Square Layer Normalization.

VisionProjector

MLP projector from vision encoder to LLM hidden space.

SoundProjection

MLP projector from sound encoder to LLM hidden space.

NemotronOmniConfig

Configuration for the NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) model.

_ModelProxy

Thin proxy so the MoE parallelizer can navigate model.model.moe_config and model.model -> get_text_module -> .layers without changing the weight hierarchy.

NemotronOmniForConditionalGeneration

NemotronOmni VLM model for conditional generation (training).

Data#

API#

nemo_automodel.components.models.nemotron_omni.model.logger#

‘getLogger(…)’

class nemo_automodel.components.models.nemotron_omni.model.SquaredReLU#

Bases: torch.nn.Module

Squared ReLU activation: ReLU(x)^2.

forward(x: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.nemotron_omni.model.RMSNorm(hidden_size: int, eps: float = 1e-05)#

Bases: torch.nn.Module

Root Mean Square Layer Normalization.

Initialization

forward(hidden_states: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.nemotron_omni.model.VisionProjector(
vit_hidden_size: int,
projector_hidden_size: int,
llm_hidden_size: int,
downsample_ratio: float = 0.5,
)#

Bases: torch.nn.Module

MLP projector from vision encoder to LLM hidden space.

HF checkpoint structure (mlp1): mlp1.0.weight -> RMSNorm weight (vit_hidden_size * pixel_shuffle_factor^2,) mlp1.1.weight -> Linear1 weight (projector_hidden_size, vit_hidden_size * pixel_shuffle_factor^2) mlp1.3.weight -> Linear2 weight (llm_hidden_size, projector_hidden_size)

Between linear1 and linear2 there is a SquaredReLU activation (index 2 in Sequential, but it has no weight).

Initialization

forward(x: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.nemotron_omni.model.SoundProjection(
sound_hidden_size: int,
projection_hidden_size: int,
llm_hidden_size: int,
bias: bool = False,
)#

Bases: torch.nn.Module

MLP projector from sound encoder to LLM hidden space.

HF checkpoint structure: sound_projection.norm.weight -> RMSNorm weight (sound_hidden_size,) sound_projection.linear1.weight -> Linear1 weight (projection_hidden_size, sound_hidden_size) sound_projection.linear2.weight -> Linear2 weight (llm_hidden_size, projection_hidden_size)

Initialization

forward(x: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.nemotron_omni.model.NemotronOmniConfig(
vision_config=None,
llm_config=None,
sound_config=None,
force_image_size=512,
downsample_ratio=0.5,
patch_size=16,
template=None,
ps_version='v2',
image_tag_type='internvl',
projector_hidden_size=20480,
vit_hidden_size=1280,
img_context_token_id=18,
video_context_token_id=131081,
sound_context_token_id=27,
video_pruning_rate=0.7,
**kwargs,
)#

Bases: transformers.configuration_utils.PretrainedConfig

Configuration for the NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) model.

This wraps the HF config and provides easy access to sub-configs.

Initialization

model_type#

‘NemotronH_Nano_Omni_Reasoning_V3’

is_composition#

True

class nemo_automodel.components.models.nemotron_omni.model._ModelProxy(
llm: nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM,
)#

Thin proxy so the MoE parallelizer can navigate model.model.moe_config and model.model -> get_text_module -> .layers without changing the weight hierarchy.

The parallelizer (parallelizer.py) expects: model.model.moe_config (for expert-count validation) model.model -> get_text_module() (finds language_model attr) -> .layers

By setting self.model = _ModelProxy(self.language_model) on the VLM: model.model.moe_config -> language_model.model.moe_config OK get_text_module(model.model) -> model.model.language_model == language_model.model (NemotronV3Model) -> .layers OK

Initialization

class nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration(
config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs,
)#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, torch.nn.Module, nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixin

NemotronOmni VLM model for conditional generation (training).

Wraps:

  • Vision encoder (RADIO v2.5-H) – HF implementation via trust_remote_code

  • Audio encoder (Parakeet) – HF implementation via trust_remote_code

  • Vision projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)

  • Sound projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)

  • Language model (NemotronH hybrid Mamba+Attention MoE) – nemotron_v3 custom impl

The LLM part reuses the nemotron_v3 implementation (NemotronHForCausalLM) which has custom DTensor parallelism for the Mamba+Attention hybrid MoE architecture.

Initialization

Initialize NemotronOmniForConditionalGeneration.

Parameters:
  • config – NemotronH_Nano_Omni_Reasoning_V3 config

  • backend – Backend configuration

  • **kwargs – Additional arguments

classmethod from_config(
config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs,
)#

Create model from config.

Parameters:
  • config – NemotronH_Nano_Omni_Reasoning_V3 config (HF config with trust_remote_code)

  • backend – Backend configuration

  • **kwargs – Additional arguments

Returns:

NemotronOmniForConditionalGeneration instance

classmethod from_pretrained(
pretrained_model_name_or_path: str,
*model_args,
**kwargs,
)#

Load pretrained model.

Parameters:
  • pretrained_model_name_or_path – Path or name of pretrained model

  • *model_args – Additional positional arguments

  • **kwargs – Additional keyword arguments

Returns:

NemotronOmniForConditionalGeneration instance

static _make_missing_buffers_non_persistent(module: torch.nn.Module) None#

Convert persistent buffers that are NOT saved in HF checkpoints to non-persistent buffers.

The RADIO vision encoder registers some buffers (e.g. summary_idxs) as persistent, but the HF checkpoint does not contain them. When the DCP loader builds its load plan it expects every persistent buffer to appear in the checkpoint and raises RuntimeError: Missing key otherwise.

This method re-registers such buffers as non-persistent so they are kept at their init-time values and not expected on disk.

get_input_embeddings()#

Return the input embeddings from the language model.

set_input_embeddings(value)#

Set the input embeddings of the language model.

get_output_embeddings()#

Return the output embeddings (lm_head) from the language model.

set_output_embeddings(new_embeddings)#

Set the output embeddings (lm_head) of the language model.

pixel_shuffle(
x: torch.Tensor,
scale_factor: float = 0.5,
) torch.Tensor#

Pixel shuffle for downsampling spatial resolution while increasing channels.

Parameters:
  • x – Input tensor [N, W, H, C]

  • scale_factor – Downsampling ratio (default 0.5 = halve spatial dims)

Returns:

Shuffled tensor [N, Wscale, Hscale, C/(scale^2)]

extract_feature(pixel_values: torch.Tensor) torch.Tensor#

Extract vision features from pixel values through RADIO + projector.

Parameters:

pixel_values – Image tensors [num_tiles, C, H, W]

Returns:

Vision embeddings [num_tiles, num_tokens, llm_hidden_size]

extract_feature_dynamic(
pixel_values: torch.Tensor,
imgs_sizes: torch.Tensor | list[tuple[int, int]],
) torch.Tensor#

Dynamic-resolution feature extraction (no tile splitting).

Matches vLLM’s dynamic-resolution vision path for Nano v3 VL / Nemotron-Omni (see 3rdparty/vllm/vllm/model_executor/models/ nano_nemotron_vl.py). Required when the rollout uses DynamicResolutionImageTiler — tile-based extract_feature would produce different embeddings and break rollout/train logprob agreement.

Unlike vLLM’s RADIO port (which supports packed imgs_sizes= inputs), the HF RADIO from nvidia/C-RADIOv2-H only accepts a dense (B, C, H, W) tensor. We crop each padded image back to its real size and run the vision model per-image, then concatenate features.

Parameters:
  • pixel_values – [num_images, C, H_padded, W_padded] batch of dynamically-resized images padded to the batch max (h, w).

  • imgs_sizes – [num_images, 2] actual (h, w) per image (torch tensor of ints) or an equivalent list of tuples.

Returns:

Vision embeddings [sum_num_embeddings_after_pixel_shuffle, llm_hidden_size].

_pixel_shuffle_dynamic_res(
x: torch.Tensor,
imgs_sizes: list[tuple[int, int]],
) torch.Tensor#

Per-image pixel-shuffle for dynamic-resolution outputs.

Ported from vLLM’s NanoNemotronVLMultimodal.pixel_shuffle_dynamic_res. Splits x along the sequence dim by per-image patch counts, reshapes each split to (N, H_patches, W_patches, C_feat), applies pixel_shuffle with downsample_ratio, and flattens back to a concatenated (N, L’, C).

extract_video_feature(
pixel_values_videos: torch.Tensor,
) torch.Tensor#

Pack T = video_temporal_patch_dim frames into channels and run the ViT.

Returns embeddings shaped like extract_feature output, but with ceil(N_frames / T) rows instead of one row per frame.

extract_sound_feature(
input_features: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
) torch.Tensor#

Extract and project sound features from audio input.

Parameters:
  • input_features – Mel spectrogram features [batch, seq_len, feature_dim]

  • attention_mask – Optional attention mask [batch, seq_len]

Returns:

Sound embeddings projected to LLM hidden size

prepare_model_inputs_for_cp(
input_ids: torch.Tensor,
pixel_values: Optional[torch.Tensor] = None,
image_flags: Optional[torch.Tensor] = None,
imgs_sizes: Optional[torch.Tensor] = None,
pixel_values_videos: Optional[torch.Tensor] = None,
sound_features: Optional[torch.Tensor] = None,
sound_attention_mask: Optional[torch.Tensor] = None,
) dict#

Merge image/video/audio features into text embeddings BEFORE CP sharding.

Under CP > 1 the sequence is sharded; multimodal scatter must run on the full un-sharded sequence so each rank ends up with embeddings that match its local slice of input_ids. Returns a dict so future per-layer inputs can ride alongside inputs_embeds.

prepare_inputs_embeds_for_cp(
input_ids: torch.Tensor,
pixel_values: Optional[torch.Tensor] = None,
image_flags: Optional[torch.Tensor] = None,
imgs_sizes: Optional[torch.Tensor] = None,
pixel_values_videos: Optional[torch.Tensor] = None,
sound_features: Optional[torch.Tensor] = None,
sound_attention_mask: Optional[torch.Tensor] = None,
) torch.Tensor#

Thin wrapper returning just inputs_embeds for callers that don’t need the full prepared-inputs dict.

forward(
pixel_values: Optional[torch.FloatTensor] = None,
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
image_flags: Optional[torch.LongTensor] = None,
imgs_sizes: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
labels: Optional[torch.LongTensor] = None,
sound_features: Optional[torch.FloatTensor] = None,
sound_attention_mask: Optional[torch.Tensor] = None,
pixel_values_videos: Optional[torch.FloatTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
*,
_pre_embed_only: bool = False,
**kwargs,
) Union[dict, Tuple, transformers.modeling_outputs.CausalLMOutputWithPast]#

Forward pass for training.

This follows the same pattern as the HF NemotronH_Nano_Omni_Reasoning_V3.forward():

  1. Get text embeddings from LLM embed_tokens

  2. Extract vision features from pixel_values

  3. Replace image token embeddings with vision embeddings

  4. Run LLM forward pass

  5. Compute loss if labels provided

Parameters:
  • pixel_values – Image pixel values [num_tiles, C, H, W]

  • input_ids – Input token IDs [batch, seq_len]

  • attention_mask – Attention mask [batch, seq_len]

  • position_ids – Position IDs (unused, for API compat)

  • image_flags – Flags indicating real images vs padding [num_tiles, 1]

  • labels – Token IDs for loss computation [batch, seq_len]

  • inputs_embeds – Pre-computed input embeddings (optional)

  • use_cache – Whether to use caching (not used in training)

  • **kwargs – Additional arguments

Returns:

CausalLMOutputWithPast with loss and logits

initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16,
) None#

Initialize model weights.

Parameters:
  • buffer_device – Device to use for buffer initialization

  • dtype – Target dtype for model weights

nemo_automodel.components.models.nemotron_omni.model.ModelClass#

None