> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.nemotron_omni.model

NemotronOmni (NemotronH\_Nano\_Omni\_Reasoning\_V3) custom model for Nemo Automodel.

This model is a VLM (vision-language model) with:

* Vision encoder: RADIO v2.5-H (ViT-Huge, patch\_size=16) -- loaded from HF
* Audio encoder: Parakeet (FastConformer-based) -- loaded from HF
* LLM: NemotronH (hybrid Mamba+Attention MoE) -- reuses nemotron\_v3 custom implementation
* Projectors: MLP projectors for vision->LLM and audio->LLM

Architecture name: "NemotronH\_Nano\_Omni\_Reasoning\_V3" (from config.json)

## Module Contents

### Classes

| Name                                                                                                                                 | Description                                                                      |
| ------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------- |
| [`NemotronOmniConfig`](#nemo_automodel-components-models-nemotron_omni-model-NemotronOmniConfig)                                     | Configuration for the NemotronOmni (NemotronH\_Nano\_Omni\_Reasoning\_V3) model. |
| [`NemotronOmniForConditionalGeneration`](#nemo_automodel-components-models-nemotron_omni-model-NemotronOmniForConditionalGeneration) | NemotronOmni VLM model for conditional generation (training).                    |
| [`RMSNorm`](#nemo_automodel-components-models-nemotron_omni-model-RMSNorm)                                                           | Root Mean Square Layer Normalization.                                            |
| [`SoundProjection`](#nemo_automodel-components-models-nemotron_omni-model-SoundProjection)                                           | MLP projector from sound encoder to LLM hidden space.                            |
| [`SquaredReLU`](#nemo_automodel-components-models-nemotron_omni-model-SquaredReLU)                                                   | Squared ReLU activation: ReLU(x)^2.                                              |
| [`VisionProjector`](#nemo_automodel-components-models-nemotron_omni-model-VisionProjector)                                           | MLP projector from vision encoder to LLM hidden space.                           |
| [`_ModelProxy`](#nemo_automodel-components-models-nemotron_omni-model-_ModelProxy)                                                   | Thin proxy so the MoE parallelizer can navigate model.model.moe\_config          |

### Data

[`ModelClass`](#nemo_automodel-components-models-nemotron_omni-model-ModelClass)

[`logger`](#nemo_automodel-components-models-nemotron_omni-model-logger)

### API

```python
class nemo_automodel.components.models.nemotron_omni.model.NemotronOmniConfig(
    vision_config = None,
    llm_config = None,
    sound_config = None,
    force_image_size = 512,
    downsample_ratio = 0.5,
    patch_size = 16,
    template = None,
    ps_version = 'v2',
    image_tag_type = 'internvl',
    projector_hidden_size = 20480,
    vit_hidden_size = 1280,
    img_context_token_id = 18,
    video_context_token_id = 131081,
    sound_context_token_id = 27,
    video_pruning_rate = 0.7,
    kwargs = {}
)
```

**Bases:** `PretrainedConfig`

Configuration for the NemotronOmni (NemotronH\_Nano\_Omni\_Reasoning\_V3) model.

This wraps the HF config and provides easy access to sub-configs.

```python
class nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

**Bases:** [HFCheckpointingMixin](/nemo-automodel/nemo_automodel/components/models/common/hf_checkpointing_mixin#nemo_automodel-components-models-common-hf_checkpointing_mixin-HFCheckpointingMixin), `Module`, [MoEFSDPSyncMixin](/nemo-automodel/nemo_automodel/components/moe/fsdp_mixin#nemo_automodel-components-moe-fsdp_mixin-MoEFSDPSyncMixin)

NemotronOmni VLM model for conditional generation (training).

Wraps:

* Vision encoder (RADIO v2.5-H) -- HF implementation via trust\_remote\_code
* Audio encoder (Parakeet) -- HF implementation via trust\_remote\_code
* Vision projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)
* Sound projector (MLP: RMSNorm -> Linear -> SquaredReLU -> Linear)
* Language model (NemotronH hybrid Mamba+Attention MoE) -- nemotron\_v3 custom impl

The LLM part reuses the nemotron\_v3 implementation (NemotronHForCausalLM) which
has custom DTensor parallelism for the Mamba+Attention hybrid MoE architecture.

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration._make_missing_buffers_non_persistent(
    module: torch.nn.Module
) -> None
```

staticmethod

Convert persistent buffers that are NOT saved in HF checkpoints
to non-persistent buffers.

The RADIO vision encoder registers some buffers (e.g. `summary_idxs`)
as persistent, but the HF checkpoint does not contain them.  When the DCP
loader builds its load plan it expects every persistent buffer to appear
in the checkpoint and raises `RuntimeError: Missing key` otherwise.

This method re-registers such buffers as non-persistent so they are
kept at their init-time values and not expected on disk.

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration._pixel_shuffle_dynamic_res(
    x: torch.Tensor,
    imgs_sizes: list[tuple[int, int]]
) -> torch.Tensor
```

Per-image pixel-shuffle for dynamic-resolution outputs.

Ported from vLLM's `NanoNemotronVLMultimodal.pixel_shuffle_dynamic_res`.
Splits `x` along the sequence dim by per-image patch counts, reshapes
each split to (N, H\_patches, W\_patches, C\_feat), applies pixel\_shuffle
with `downsample_ratio`, and flattens back to a concatenated (N, L', C).

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.extract_feature(
    pixel_values: torch.Tensor
) -> torch.Tensor
```

Extract vision features from pixel values through RADIO + projector.

**Parameters:**

Image tensors \[num\_tiles, C, H, W]

**Returns:** `torch.Tensor`

Vision embeddings \[num\_tiles, num\_tokens, llm\_hidden\_size]

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.extract_feature_dynamic(
    pixel_values: torch.Tensor,
    imgs_sizes: torch.Tensor | list[tuple[int, int]]
) -> torch.Tensor
```

Dynamic-resolution feature extraction (no tile splitting).

Matches vLLM's dynamic-resolution vision path for Nano v3 VL /
Nemotron-Omni (see 3rdparty/vllm/vllm/model\_executor/models/
nano\_nemotron\_vl.py). Required when the rollout uses
DynamicResolutionImageTiler — tile-based `extract_feature` would
produce different embeddings and break rollout/train logprob
agreement.

Unlike vLLM's RADIO port (which supports packed `imgs_sizes=` inputs),
the HF RADIO from nvidia/C-RADIOv2-H only accepts a dense
`(B, C, H, W)` tensor. We crop each padded image back to its real
size and run the vision model per-image, then concatenate features.

**Parameters:**

\[num\_images, C, H\_padded, W\_padded] batch of
dynamically-resized images padded to the batch max (h, w).

\[num\_images, 2] actual (h, w) per image (torch tensor
of ints) or an equivalent list of tuples.

**Returns:** `torch.Tensor`

Vision embeddings \[sum\_num\_embeddings\_after\_pixel\_shuffle,

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.extract_sound_feature(
    input_features: torch.Tensor,
    attention_mask: typing.Optional[torch.Tensor] = None
) -> torch.Tensor
```

Extract and project sound features from audio input.

**Parameters:**

Mel spectrogram features \[batch, seq\_len, feature\_dim]

Optional attention mask \[batch, seq\_len]

**Returns:** `torch.Tensor`

Sound embeddings projected to LLM hidden size

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.extract_video_feature(
    pixel_values_videos: torch.Tensor
) -> torch.Tensor
```

Pack `T = video_temporal_patch_dim` frames into channels and run the ViT.

Returns embeddings shaped like `extract_feature` output, but with
`ceil(N_frames / T)` rows instead of one row per frame.

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.forward(
    pixel_values: typing.Optional[torch.FloatTensor] = None,
    input_ids: typing.Optional[torch.LongTensor] = None,
    attention_mask: typing.Optional[torch.Tensor] = None,
    position_ids: typing.Optional[torch.LongTensor] = None,
    image_flags: typing.Optional[torch.LongTensor] = None,
    imgs_sizes: typing.Optional[torch.LongTensor] = None,
    past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None,
    labels: typing.Optional[torch.LongTensor] = None,
    sound_features: typing.Optional[torch.FloatTensor] = None,
    sound_attention_mask: typing.Optional[torch.Tensor] = None,
    pixel_values_videos: typing.Optional[torch.FloatTensor] = None,
    inputs_embeds: typing.Optional[torch.FloatTensor] = None,
    use_cache: typing.Optional[bool] = None,
    output_attentions: typing.Optional[bool] = None,
    output_hidden_states: typing.Optional[bool] = None,
    return_dict: typing.Optional[bool] = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    _pre_embed_only: bool = False,
    kwargs = {}
) -> typing.Union[dict, typing.Tuple, transformers.modeling_outputs.CausalLMOutputWithPast]
```

Forward pass for training.

This follows the same pattern as the HF NemotronH\_Nano\_Omni\_Reasoning\_V3.forward():

1. Get text embeddings from LLM embed\_tokens
2. Extract vision features from pixel\_values
3. Replace image token embeddings with vision embeddings
4. Run LLM forward pass
5. Compute loss if labels provided

**Parameters:**

Image pixel values \[num\_tiles, C, H, W]

Input token IDs \[batch, seq\_len]

Attention mask \[batch, seq\_len]

Position IDs (unused, for API compat)

Flags indicating real images vs padding \[num\_tiles, 1]

Token IDs for loss computation \[batch, seq\_len]

Pre-computed input embeddings (optional)

Whether to use caching (not used in training)

Whether the returned output should carry the
final decoder hidden states (required for fused linear
cross-entropy / cut-CE). Defaults to the text sub-config's
`output_hidden_states` when `None`.

If 0 (default), compute logits for all positions;
if > 0, only compute logits for the last `logits_to_keep`
positions (used by fused linear cross-entropy to avoid the full
logit matrix). Forwarded to the language-model lm\_head gating.

Additional arguments

**Returns:** `Union[dict, Tuple, CausalLMOutputWithPast]`

CausalLMOutputWithPast with loss and logits

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.from_config(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

classmethod

Create model from config.

**Parameters:**

NemotronH\_Nano\_Omni\_Reasoning\_V3 config (HF config with trust\_remote\_code)

Backend configuration

Additional arguments

**Returns:**

NemotronOmniForConditionalGeneration instance

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)
```

classmethod

Load pretrained model.

**Parameters:**

Path or name of pretrained model

Additional positional arguments

Additional keyword arguments

**Returns:**

NemotronOmniForConditionalGeneration instance

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.get_input_embeddings()
```

Return the input embeddings from the language model.

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.get_output_embeddings()
```

Return the output embeddings (lm\_head) from the language model.

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None
```

Initialize model weights.

**Parameters:**

Device to use for buffer initialization

Target dtype for model weights

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.pixel_shuffle(
    x: torch.Tensor,
    scale_factor: float = 0.5
) -> torch.Tensor
```

Pixel shuffle for downsampling spatial resolution while increasing channels.

**Parameters:**

Input tensor \[N, W, H, C]

Downsampling ratio (default 0.5 = halve spatial dims)

**Returns:** `torch.Tensor`

Shuffled tensor \[N, W*scale, H*scale, C/(scale^2)]

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.prepare_inputs_embeds_for_cp(
    input_ids: torch.Tensor,
    pixel_values: typing.Optional[torch.Tensor] = None,
    image_flags: typing.Optional[torch.Tensor] = None,
    imgs_sizes: typing.Optional[torch.Tensor] = None,
    pixel_values_videos: typing.Optional[torch.Tensor] = None,
    sound_features: typing.Optional[torch.Tensor] = None,
    sound_attention_mask: typing.Optional[torch.Tensor] = None
) -> torch.Tensor
```

Thin wrapper returning just `inputs_embeds` for callers that don't
need the full prepared-inputs dict.

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.prepare_model_inputs_for_cp(
    input_ids: torch.Tensor,
    pixel_values: typing.Optional[torch.Tensor] = None,
    image_flags: typing.Optional[torch.Tensor] = None,
    imgs_sizes: typing.Optional[torch.Tensor] = None,
    pixel_values_videos: typing.Optional[torch.Tensor] = None,
    sound_features: typing.Optional[torch.Tensor] = None,
    sound_attention_mask: typing.Optional[torch.Tensor] = None
) -> dict
```

Merge image/video/audio features into text embeddings BEFORE CP sharding.

Under CP > 1 the sequence is sharded; multimodal scatter must run on the
full un-sharded sequence so each rank ends up with embeddings that match
its local slice of input\_ids. Returns a dict so future per-layer inputs
can ride alongside `inputs_embeds`.

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.set_input_embeddings(
    value
)
```

Set the input embeddings of the language model.

```python
nemo_automodel.components.models.nemotron_omni.model.NemotronOmniForConditionalGeneration.set_output_embeddings(
    new_embeddings
)
```

Set the output embeddings (lm\_head) of the language model.

```python
class nemo_automodel.components.models.nemotron_omni.model.RMSNorm(
    hidden_size: int,
    eps: float = 1e-05
)
```

**Bases:** `Module`

Root Mean Square Layer Normalization.

```python
nemo_automodel.components.models.nemotron_omni.model.RMSNorm.forward(
    hidden_states: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.nemotron_omni.model.SoundProjection(
    sound_hidden_size: int,
    projection_hidden_size: int,
    llm_hidden_size: int,
    bias: bool = False
)
```

**Bases:** `Module`

MLP projector from sound encoder to LLM hidden space.

```python
nemo_automodel.components.models.nemotron_omni.model.SoundProjection.forward(
    x: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.nemotron_omni.model.SquaredReLU()
```

**Bases:** `Module`

Squared ReLU activation: ReLU(x)^2.

```python
nemo_automodel.components.models.nemotron_omni.model.SquaredReLU.forward(
    x: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.nemotron_omni.model.VisionProjector(
    vit_hidden_size: int,
    projector_hidden_size: int,
    llm_hidden_size: int,
    downsample_ratio: float = 0.5
)
```

**Bases:** `Module`

MLP projector from vision encoder to LLM hidden space.

HF checkpoint structure (mlp1):
mlp1.0.weight  ->  RMSNorm weight  (vit\_hidden\_size \* pixel\_shuffle\_factor^2,)
mlp1.1.weight  ->  Linear1 weight  (projector\_hidden\_size, vit\_hidden\_size \* pixel\_shuffle\_factor^2)
mlp1.3.weight  ->  Linear2 weight  (llm\_hidden\_size, projector\_hidden\_size)

Between linear1 and linear2 there is a SquaredReLU activation (index 2 in Sequential,
but it has no weight).

```python
nemo_automodel.components.models.nemotron_omni.model.VisionProjector.forward(
    x: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.nemotron_omni.model._ModelProxy(
    llm: nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM
)
```

Thin proxy so the MoE parallelizer can navigate model.model.moe\_config
and model.model -> get\_text\_module -> .layers without changing the weight
hierarchy.

The parallelizer (parallelizer.py) expects:
model.model.moe\_config           (for expert-count validation)
model.model -> get\_text\_module()  (finds language\_model attr) -> .layers

By setting self.model = \_ModelProxy(self.language\_model) on the VLM:
model.model.moe\_config            -> language\_model.model.moe\_config  OK
get\_text\_module(model.model)       -> model.model.language\_model
\== language\_model.model (NemotronV3Model)
-> .layers                          OK

```python
nemo_automodel.components.models.nemotron_omni.model.ModelClass = NemotronOmniForConditionalGeneration
```

```python
nemo_automodel.components.models.nemotron_omni.model.logger = logging.getLogger(__name__)
```