> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.qwen3_omni_moe.model

## Module Contents

### Classes

| Name                                                                                                                                                | Description                                                              |
| --------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
| [`Qwen3OmniMoeThinkerForConditionalGeneration`](#nemo_automodel-components-models-qwen3_omni_moe-model-Qwen3OmniMoeThinkerForConditionalGeneration) | Qwen3OmniMoe Thinker for Conditional Generation with multimodal support. |
| [`Qwen3OmniMoeThinkerTextModel`](#nemo_automodel-components-models-qwen3_omni_moe-model-Qwen3OmniMoeThinkerTextModel)                               | Qwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers.        |

### Data

[`ModelClass`](#nemo_automodel-components-models-qwen3_omni_moe-model-ModelClass)

### API

```python
class nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration(
    config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeThinkerConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

**Bases:** [HFCheckpointingMixin](/nemo-automodel/nemo_automodel/components/models/common/hf_checkpointing_mixin#nemo_automodel-components-models-common-hf_checkpointing_mixin-HFCheckpointingMixin), `HFQwen3OmniMoeThinkerForConditionalGeneration`, [MoEFSDPSyncMixin](/nemo-automodel/nemo_automodel/components/moe/fsdp_mixin#nemo_automodel-components-moe-fsdp_mixin-MoEFSDPSyncMixin)

Qwen3OmniMoe Thinker for Conditional Generation with multimodal support.

```python
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.forward(
    input_ids: torch.Tensor,
    input_features: torch.FloatTensor | None = None,
    pixel_values: torch.FloatTensor | None = None,
    pixel_values_videos: torch.FloatTensor | None = None,
    image_grid_thw: torch.LongTensor | None = None,
    video_grid_thw: torch.LongTensor | None = None,
    attention_mask: torch.Tensor | None = None,
    feature_attention_mask: torch.LongTensor | None = None,
    audio_feature_lengths: torch.LongTensor | None = None,
    position_ids: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    inputs_embeds: torch.FloatTensor | None = None,
    labels: torch.LongTensor | None = None,
    output_router_logits: bool | None = None,
    use_audio_in_video: bool | None = None,
    video_second_per_grid: torch.Tensor | None = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    output_hidden_states: typing.Optional[bool] = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor | dict | transformers.modeling_outputs.CausalLMOutputWithPast
```

Forward pass with multimodal fusion.

**Parameters:**

Input token IDs

Audio input features

Image pixel values

Video pixel values

Image grid (temporal, height, width)

Video grid (temporal, height, width)

Attention mask

Feature attention mask for audio

Audio feature lengths

Position IDs (3D for MRoPE)

Padding mask

Optional pre-computed input embeddings

Labels for loss computation

Whether to output router logits

Whether audio is in video

Seconds per grid for videos

If > 0, only compute logits for the last
`logits_to_keep` token positions (0 = all positions). Enables
memory-efficient fused cross-entropy by letting the recipe request
a single-position lm\_head projection alongside the final hidden
states.

When set, the returned output carries the final
hidden states (the input to `lm_head`) so the recipe can run
fused linear cross-entropy.

Additional attention arguments

**Returns:** `torch.Tensor | dict | CausalLMOutputWithPast`

Logits tensor, a dict with loss/aux\_loss if labels provided, or a

```python
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.from_config(
    config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeThinkerConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.get_input_embeddings()
```

```python
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.get_output_embeddings()
```

```python
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None
```

```python
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.set_input_embeddings(
    value
)
```

```python
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.set_output_embeddings(
    new_embeddings
)
```

```python
class nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel(
    config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeTextConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    moe_overrides: dict | None = None
)
```

**Bases:** `Module`

Qwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers.

```python
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel._deepstack_process(
    hidden_states,
    visual_pos_masks,
    visual_embeds
)
```

```python
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel.forward(
    input_ids: torch.Tensor | None = None,
    inputs_embeds: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    visual_pos_masks: torch.Tensor | None = None,
    deepstack_visual_embeds: list[torch.Tensor] | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor
```

visual\_pos\_masks (`torch.Tensor` of shape `(batch_size, seqlen)`, *optional*):
The mask of the visual positions.
deepstack\_visual\_embeds (`list[torch.Tensor]`, *optional*):
The deepstack visual embeddings. The shape is (num\_layers, visual\_seqlen, embed\_dim).
The feature is extracted from the different visual encoder layers, and fed to the decoder
hidden states. It's from the paper DeepStack([https://arxiv.org/abs/2406.04334](https://arxiv.org/abs/2406.04334)).

```python
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel.init_weights(
    buffer_device: torch.device | None = None
) -> None
```

```python
nemo_automodel.components.models.qwen3_omni_moe.model.ModelClass = Qwen3OmniMoeThinkerForConditionalGeneration
```