nemo_automodel.components.models.nemotron_omni.model
nemo_automodel.components.models.nemotron_omni.model
NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) custom model for Nemo Automodel.
This model is a VLM (vision-language model) with:
- Vision encoder: RADIO v2.5-H (ViT-Huge, patch_size=16) — loaded from HF
- Audio encoder: Parakeet (FastConformer-based) — loaded from HF
- LLM: NemotronH (hybrid Mamba+Attention MoE) — reuses nemotron_v3 custom implementation
- Projectors: MLP projectors for vision->LLM and audio->LLM
Architecture name: “NemotronH_Nano_Omni_Reasoning_V3” (from config.json)
Module Contents
Classes
Data
API
Bases: PretrainedConfig
Configuration for the NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) model.
This wraps the HF config and provides easy access to sub-configs.
Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin
NemotronOmni VLM model for conditional generation (training).
Wraps:
- Vision encoder (RADIO v2.5-H) — HF implementation via trust_remote_code
- Audio encoder (Parakeet) — HF implementation via trust_remote_code
- Vision projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)
- Sound projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)
- Language model (NemotronH hybrid Mamba+Attention MoE) — nemotron_v3 custom impl
The LLM part reuses the nemotron_v3 implementation (NemotronHForCausalLM) which has custom DTensor parallelism for the Mamba+Attention hybrid MoE architecture.
Convert persistent buffers that are NOT saved in HF checkpoints to non-persistent buffers.
The RADIO vision encoder registers some buffers (e.g. summary_idxs)
as persistent, but the HF checkpoint does not contain them. When the DCP
loader builds its load plan it expects every persistent buffer to appear
in the checkpoint and raises RuntimeError: Missing key otherwise.
This method re-registers such buffers as non-persistent so they are kept at their init-time values and not expected on disk.
Per-image pixel-shuffle for dynamic-resolution outputs.
Ported from vLLM’s NanoNemotronVLMultimodal.pixel_shuffle_dynamic_res.
Splits x along the sequence dim by per-image patch counts, reshapes
each split to (N, H_patches, W_patches, C_feat), applies pixel_shuffle
with downsample_ratio, and flattens back to a concatenated (N, L’, C).
Extract vision features from pixel values through RADIO + projector.
Parameters:
Image tensors [num_tiles, C, H, W]
Returns: torch.Tensor
Vision embeddings [num_tiles, num_tokens, llm_hidden_size]
Dynamic-resolution feature extraction (no tile splitting).
Matches vLLM’s dynamic-resolution vision path for Nano v3 VL /
Nemotron-Omni (see 3rdparty/vllm/vllm/model_executor/models/
nano_nemotron_vl.py). Required when the rollout uses
DynamicResolutionImageTiler — tile-based extract_feature would
produce different embeddings and break rollout/train logprob
agreement.
Unlike vLLM’s RADIO port (which supports packed imgs_sizes= inputs),
the HF RADIO from nvidia/C-RADIOv2-H only accepts a dense
(B, C, H, W) tensor. We crop each padded image back to its real
size and run the vision model per-image, then concatenate features.
Parameters:
[num_images, C, H_padded, W_padded] batch of dynamically-resized images padded to the batch max (h, w).
[num_images, 2] actual (h, w) per image (torch tensor of ints) or an equivalent list of tuples.
Returns: torch.Tensor
Vision embeddings [sum_num_embeddings_after_pixel_shuffle,
Extract and project sound features from audio input.
Parameters:
Mel spectrogram features [batch, seq_len, feature_dim]
Optional attention mask [batch, seq_len]
Returns: torch.Tensor
Sound embeddings projected to LLM hidden size
Pack T = video_temporal_patch_dim frames into channels and run the ViT.
Returns embeddings shaped like extract_feature output, but with
ceil(N_frames / T) rows instead of one row per frame.
Forward pass for training.
This follows the same pattern as the HF NemotronH_Nano_Omni_Reasoning_V3.forward():
- Get text embeddings from LLM embed_tokens
- Extract vision features from pixel_values
- Replace image token embeddings with vision embeddings
- Run LLM forward pass
- Compute loss if labels provided
Parameters:
Image pixel values [num_tiles, C, H, W]
Input token IDs [batch, seq_len]
Attention mask [batch, seq_len]
Position IDs (unused, for API compat)
Flags indicating real images vs padding [num_tiles, 1]
Token IDs for loss computation [batch, seq_len]
Pre-computed input embeddings (optional)
Whether to use caching (not used in training)
Whether the returned output should carry the
final decoder hidden states (required for fused linear
cross-entropy / cut-CE). Defaults to the text sub-config’s
output_hidden_states when None.
If 0 (default), compute logits for all positions;
if > 0, only compute logits for the last logits_to_keep
positions (used by fused linear cross-entropy to avoid the full
logit matrix). Forwarded to the language-model lm_head gating.
Additional arguments
Returns: Union[dict, Tuple, CausalLMOutputWithPast]
CausalLMOutputWithPast with loss and logits
Create model from config.
Parameters:
NemotronH_Nano_Omni_Reasoning_V3 config (HF config with trust_remote_code)
Backend configuration
Additional arguments
Returns:
NemotronOmniForConditionalGeneration instance
Load pretrained model.
Parameters:
Path or name of pretrained model
Additional positional arguments
Additional keyword arguments
Returns:
NemotronOmniForConditionalGeneration instance
Return the input embeddings from the language model.
Return the output embeddings (lm_head) from the language model.
Initialize model weights.
Parameters:
Device to use for buffer initialization
Target dtype for model weights
Pixel shuffle for downsampling spatial resolution while increasing channels.
Parameters:
Input tensor [N, W, H, C]
Downsampling ratio (default 0.5 = halve spatial dims)
Returns: torch.Tensor
Shuffled tensor [N, Wscale, Hscale, C/(scale^2)]
Thin wrapper returning just inputs_embeds for callers that don’t
need the full prepared-inputs dict.
Merge image/video/audio features into text embeddings BEFORE CP sharding.
Under CP > 1 the sequence is sharded; multimodal scatter must run on the
full un-sharded sequence so each rank ends up with embeddings that match
its local slice of input_ids. Returns a dict so future per-layer inputs
can ride alongside inputs_embeds.
Set the input embeddings of the language model.
Set the output embeddings (lm_head) of the language model.
Bases: Module
Root Mean Square Layer Normalization.
Bases: Module
MLP projector from sound encoder to LLM hidden space.
Bases: Module
Squared ReLU activation: ReLU(x)^2.
Bases: Module
MLP projector from vision encoder to LLM hidden space.
HF checkpoint structure (mlp1): mlp1.0.weight -> RMSNorm weight (vit_hidden_size * pixel_shuffle_factor^2,) mlp1.1.weight -> Linear1 weight (projector_hidden_size, vit_hidden_size * pixel_shuffle_factor^2) mlp1.3.weight -> Linear2 weight (llm_hidden_size, projector_hidden_size)
Between linear1 and linear2 there is a SquaredReLU activation (index 2 in Sequential, but it has no weight).
Thin proxy so the MoE parallelizer can navigate model.model.moe_config and model.model -> get_text_module -> .layers without changing the weight hierarchy.
The parallelizer (parallelizer.py) expects: model.model.moe_config (for expert-count validation) model.model -> get_text_module() (finds language_model attr) -> .layers
By setting self.model = _ModelProxy(self.language_model) on the VLM: model.model.moe_config -> language_model.model.moe_config OK get_text_module(model.model) -> model.model.language_model == language_model.model (NemotronV3Model) -> .layers OK