nemo_automodel.components.models.nemotron_omni.model#
NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) custom model for Nemo Automodel.
This model is a VLM (vision-language model) with:
Vision encoder: RADIO v2.5-H (ViT-Huge, patch_size=16) – loaded from HF
Audio encoder: Parakeet (FastConformer-based) – loaded from HF
LLM: NemotronH (hybrid Mamba+Attention MoE) – reuses nemotron_v3 custom implementation
Projectors: MLP projectors for vision->LLM and audio->LLM
Architecture name: “NemotronH_Nano_Omni_Reasoning_V3” (from config.json)
Module Contents#
Classes#
Squared ReLU activation: ReLU(x)^2. |
|
Root Mean Square Layer Normalization. |
|
MLP projector from vision encoder to LLM hidden space. |
|
MLP projector from sound encoder to LLM hidden space. |
|
Configuration for the NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) model. |
|
Thin proxy so the MoE parallelizer can navigate model.model.moe_config and model.model -> get_text_module -> .layers without changing the weight hierarchy. |
|
NemotronOmni VLM model for conditional generation (training). |
Data#
API#
- nemo_automodel.components.models.nemotron_omni.model.logger#
‘getLogger(…)’
- class nemo_automodel.components.models.nemotron_omni.model.SquaredReLU#
Bases:
torch.nn.ModuleSquared ReLU activation: ReLU(x)^2.
- forward(x: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.nemotron_omni.model.RMSNorm(hidden_size: int, eps: float = 1e-05)#
Bases:
torch.nn.ModuleRoot Mean Square Layer Normalization.
Initialization
- forward(hidden_states: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.nemotron_omni.model.VisionProjector(
- vit_hidden_size: int,
- projector_hidden_size: int,
- llm_hidden_size: int,
- downsample_ratio: float = 0.5,
Bases:
torch.nn.ModuleMLP projector from vision encoder to LLM hidden space.
HF checkpoint structure (mlp1): mlp1.0.weight -> RMSNorm weight (vit_hidden_size * pixel_shuffle_factor^2,) mlp1.1.weight -> Linear1 weight (projector_hidden_size, vit_hidden_size * pixel_shuffle_factor^2) mlp1.3.weight -> Linear2 weight (llm_hidden_size, projector_hidden_size)
Between linear1 and linear2 there is a SquaredReLU activation (index 2 in Sequential, but it has no weight).
Initialization
- forward(x: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.nemotron_omni.model.SoundProjection(
- sound_hidden_size: int,
- projection_hidden_size: int,
- llm_hidden_size: int,
- bias: bool = False,
Bases:
torch.nn.ModuleMLP projector from sound encoder to LLM hidden space.
HF checkpoint structure: sound_projection.norm.weight -> RMSNorm weight (sound_hidden_size,) sound_projection.linear1.weight -> Linear1 weight (projection_hidden_size, sound_hidden_size) sound_projection.linear2.weight -> Linear2 weight (llm_hidden_size, projection_hidden_size)
Initialization
- forward(x: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.nemotron_omni.model.NemotronOmniConfig(
- vision_config=None,
- llm_config=None,
- sound_config=None,
- force_image_size=512,
- downsample_ratio=0.5,
- patch_size=16,
- template=None,
- ps_version='v2',
- image_tag_type='internvl',
- projector_hidden_size=20480,
- vit_hidden_size=1280,
- img_context_token_id=18,
- video_context_token_id=131081,
- sound_context_token_id=27,
- video_pruning_rate=0.7,
- **kwargs,
Bases:
transformers.configuration_utils.PretrainedConfigConfiguration for the NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) model.
This wraps the HF config and provides easy access to sub-configs.
Initialization
- model_type#
‘NemotronH_Nano_Omni_Reasoning_V3’
- is_composition#
True
- class nemo_automodel.components.models.nemotron_omni.model._ModelProxy( )#
Thin proxy so the MoE parallelizer can navigate model.model.moe_config and model.model -> get_text_module -> .layers without changing the weight hierarchy.
The parallelizer (parallelizer.py) expects: model.model.moe_config (for expert-count validation) model.model -> get_text_module() (finds language_model attr) -> .layers
By setting self.model = _ModelProxy(self.language_model) on the VLM: model.model.moe_config -> language_model.model.moe_config OK get_text_module(model.model) -> model.model.language_model == language_model.model (NemotronV3Model) -> .layers OK
Initialization
- class nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration(
- config,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
Bases:
nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin,torch.nn.Module,nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixinNemotronOmni VLM model for conditional generation (training).
Wraps:
Vision encoder (RADIO v2.5-H) – HF implementation via trust_remote_code
Audio encoder (Parakeet) – HF implementation via trust_remote_code
Vision projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)
Sound projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)
Language model (NemotronH hybrid Mamba+Attention MoE) – nemotron_v3 custom impl
The LLM part reuses the nemotron_v3 implementation (NemotronHForCausalLM) which has custom DTensor parallelism for the Mamba+Attention hybrid MoE architecture.
Initialization
Initialize NemotronOmniForConditionalGeneration.
- Parameters:
config – NemotronH_Nano_Omni_Reasoning_V3 config
backend – Backend configuration
**kwargs – Additional arguments
- classmethod from_config(
- config,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
Create model from config.
- Parameters:
config – NemotronH_Nano_Omni_Reasoning_V3 config (HF config with trust_remote_code)
backend – Backend configuration
**kwargs – Additional arguments
- Returns:
NemotronOmniForConditionalGeneration instance
- classmethod from_pretrained(
- pretrained_model_name_or_path: str,
- *model_args,
- **kwargs,
Load pretrained model.
- Parameters:
pretrained_model_name_or_path – Path or name of pretrained model
*model_args – Additional positional arguments
**kwargs – Additional keyword arguments
- Returns:
NemotronOmniForConditionalGeneration instance
- static _make_missing_buffers_non_persistent(module: torch.nn.Module) None#
Convert persistent buffers that are NOT saved in HF checkpoints to non-persistent buffers.
The RADIO vision encoder registers some buffers (e.g.
summary_idxs) as persistent, but the HF checkpoint does not contain them. When the DCP loader builds its load plan it expects every persistent buffer to appear in the checkpoint and raisesRuntimeError: Missing keyotherwise.This method re-registers such buffers as non-persistent so they are kept at their init-time values and not expected on disk.
- get_input_embeddings()#
Return the input embeddings from the language model.
- set_input_embeddings(value)#
Set the input embeddings of the language model.
- get_output_embeddings()#
Return the output embeddings (lm_head) from the language model.
- set_output_embeddings(new_embeddings)#
Set the output embeddings (lm_head) of the language model.
- pixel_shuffle(
- x: torch.Tensor,
- scale_factor: float = 0.5,
Pixel shuffle for downsampling spatial resolution while increasing channels.
- Parameters:
x – Input tensor [N, W, H, C]
scale_factor – Downsampling ratio (default 0.5 = halve spatial dims)
- Returns:
Shuffled tensor [N, Wscale, Hscale, C/(scale^2)]
- extract_feature(pixel_values: torch.Tensor) torch.Tensor#
Extract vision features from pixel values through RADIO + projector.
- Parameters:
pixel_values – Image tensors [num_tiles, C, H, W]
- Returns:
Vision embeddings [num_tiles, num_tokens, llm_hidden_size]
- extract_feature_dynamic(
- pixel_values: torch.Tensor,
- imgs_sizes: torch.Tensor | list[tuple[int, int]],
Dynamic-resolution feature extraction (no tile splitting).
Matches vLLM’s dynamic-resolution vision path for Nano v3 VL / Nemotron-Omni (see 3rdparty/vllm/vllm/model_executor/models/ nano_nemotron_vl.py). Required when the rollout uses DynamicResolutionImageTiler — tile-based
extract_featurewould produce different embeddings and break rollout/train logprob agreement.Unlike vLLM’s RADIO port (which supports packed
imgs_sizes=inputs), the HF RADIO from nvidia/C-RADIOv2-H only accepts a dense(B, C, H, W)tensor. We crop each padded image back to its real size and run the vision model per-image, then concatenate features.- Parameters:
pixel_values – [num_images, C, H_padded, W_padded] batch of dynamically-resized images padded to the batch max (h, w).
imgs_sizes – [num_images, 2] actual (h, w) per image (torch tensor of ints) or an equivalent list of tuples.
- Returns:
Vision embeddings [sum_num_embeddings_after_pixel_shuffle, llm_hidden_size].
- _pixel_shuffle_dynamic_res(
- x: torch.Tensor,
- imgs_sizes: list[tuple[int, int]],
Per-image pixel-shuffle for dynamic-resolution outputs.
Ported from vLLM’s
NanoNemotronVLMultimodal.pixel_shuffle_dynamic_res. Splitsxalong the sequence dim by per-image patch counts, reshapes each split to (N, H_patches, W_patches, C_feat), applies pixel_shuffle withdownsample_ratio, and flattens back to a concatenated (N, L’, C).
- extract_video_feature(
- pixel_values_videos: torch.Tensor,
Pack
T = video_temporal_patch_dimframes into channels and run the ViT.Returns embeddings shaped like
extract_featureoutput, but withceil(N_frames / T)rows instead of one row per frame.
- extract_sound_feature(
- input_features: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
Extract and project sound features from audio input.
- Parameters:
input_features – Mel spectrogram features [batch, seq_len, feature_dim]
attention_mask – Optional attention mask [batch, seq_len]
- Returns:
Sound embeddings projected to LLM hidden size
- prepare_model_inputs_for_cp(
- input_ids: torch.Tensor,
- pixel_values: Optional[torch.Tensor] = None,
- image_flags: Optional[torch.Tensor] = None,
- imgs_sizes: Optional[torch.Tensor] = None,
- pixel_values_videos: Optional[torch.Tensor] = None,
- sound_features: Optional[torch.Tensor] = None,
- sound_attention_mask: Optional[torch.Tensor] = None,
Merge image/video/audio features into text embeddings BEFORE CP sharding.
Under CP > 1 the sequence is sharded; multimodal scatter must run on the full un-sharded sequence so each rank ends up with embeddings that match its local slice of input_ids. Returns a dict so future per-layer inputs can ride alongside
inputs_embeds.
- prepare_inputs_embeds_for_cp(
- input_ids: torch.Tensor,
- pixel_values: Optional[torch.Tensor] = None,
- image_flags: Optional[torch.Tensor] = None,
- imgs_sizes: Optional[torch.Tensor] = None,
- pixel_values_videos: Optional[torch.Tensor] = None,
- sound_features: Optional[torch.Tensor] = None,
- sound_attention_mask: Optional[torch.Tensor] = None,
Thin wrapper returning just
inputs_embedsfor callers that don’t need the full prepared-inputs dict.
- forward(
- pixel_values: Optional[torch.FloatTensor] = None,
- input_ids: Optional[torch.LongTensor] = None,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- image_flags: Optional[torch.LongTensor] = None,
- imgs_sizes: Optional[torch.LongTensor] = None,
- past_key_values: Optional[List[torch.FloatTensor]] = None,
- labels: Optional[torch.LongTensor] = None,
- sound_features: Optional[torch.FloatTensor] = None,
- sound_attention_mask: Optional[torch.Tensor] = None,
- pixel_values_videos: Optional[torch.FloatTensor] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- use_cache: Optional[bool] = None,
- output_attentions: Optional[bool] = None,
- output_hidden_states: Optional[bool] = None,
- return_dict: Optional[bool] = None,
- *,
- _pre_embed_only: bool = False,
- **kwargs,
Forward pass for training.
This follows the same pattern as the HF NemotronH_Nano_Omni_Reasoning_V3.forward():
Get text embeddings from LLM embed_tokens
Extract vision features from pixel_values
Replace image token embeddings with vision embeddings
Run LLM forward pass
Compute loss if labels provided
- Parameters:
pixel_values – Image pixel values [num_tiles, C, H, W]
input_ids – Input token IDs [batch, seq_len]
attention_mask – Attention mask [batch, seq_len]
position_ids – Position IDs (unused, for API compat)
image_flags – Flags indicating real images vs padding [num_tiles, 1]
labels – Token IDs for loss computation [batch, seq_len]
inputs_embeds – Pre-computed input embeddings (optional)
use_cache – Whether to use caching (not used in training)
**kwargs – Additional arguments
- Returns:
CausalLMOutputWithPast with loss and logits
- initialize_weights(
- buffer_device: torch.device | None = None,
- dtype: torch.dtype = torch.bfloat16,
Initialize model weights.
- Parameters:
buffer_device – Device to use for buffer initialization
dtype – Target dtype for model weights
- nemo_automodel.components.models.nemotron_omni.model.ModelClass#
None