nemo_automodel.components.models.qwen3_omni_moe.model

Module Contents

Classes

Name	Description
`Qwen3OmniMoeThinkerForConditionalGeneration`	Qwen3OmniMoe Thinker for Conditional Generation with multimodal support.
`Qwen3OmniMoeThinkerTextModel`	Qwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers.

Data

ModelClass

API

class nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration(
    config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeThinkerConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)

Bases: HFCheckpointingMixin, HFQwen3OmniMoeThinkerForConditionalGeneration, MoEFSDPSyncMixin

Qwen3OmniMoe Thinker for Conditional Generation with multimodal support.

lm_head

model

num_experts

= text_config.num_experts

num_experts_per_tok

= text_config.num_experts_per_tok

pad_token_id

router_aux_loss_coef

= getattr(text_config, 'router_aux_loss_coef', 0.0)

spatial_merge_size

state_dict_adapter

vocab_size

= text_config.vocab_size

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.forward(
    input_ids: torch.Tensor,
    input_features: torch.FloatTensor | None = None,
    pixel_values: torch.FloatTensor | None = None,
    pixel_values_videos: torch.FloatTensor | None = None,
    image_grid_thw: torch.LongTensor | None = None,
    video_grid_thw: torch.LongTensor | None = None,
    attention_mask: torch.Tensor | None = None,
    feature_attention_mask: torch.LongTensor | None = None,
    audio_feature_lengths: torch.LongTensor | None = None,
    position_ids: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    inputs_embeds: torch.FloatTensor | None = None,
    labels: torch.LongTensor | None = None,
    output_router_logits: bool | None = None,
    use_audio_in_video: bool | None = None,
    video_second_per_grid: torch.Tensor | None = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    output_hidden_states: typing.Optional[bool] = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor | dict | transformers.modeling_outputs.CausalLMOutputWithPast

Forward pass with multimodal fusion.

Parameters:

input_ids

torch.Tensor

Input token IDs

input_features

torch.FloatTensor | NoneDefaults to None

Audio input features

pixel_values

torch.FloatTensor | NoneDefaults to None

Image pixel values

pixel_values_videos

torch.FloatTensor | NoneDefaults to None

Video pixel values

image_grid_thw

torch.LongTensor | NoneDefaults to None

Image grid (temporal, height, width)

video_grid_thw

torch.LongTensor | NoneDefaults to None

Video grid (temporal, height, width)

attention_mask

torch.Tensor | NoneDefaults to None

Attention mask

feature_attention_mask

torch.LongTensor | NoneDefaults to None

Feature attention mask for audio

audio_feature_lengths

torch.LongTensor | NoneDefaults to None

Audio feature lengths

position_ids

torch.Tensor | NoneDefaults to None

Position IDs (3D for MRoPE)

padding_mask

torch.Tensor | NoneDefaults to None

Padding mask

inputs_embeds

torch.FloatTensor | NoneDefaults to None

Optional pre-computed input embeddings

labels

torch.LongTensor | NoneDefaults to None

Labels for loss computation

output_router_logits

bool | NoneDefaults to None

Whether to output router logits

use_audio_in_video

bool | NoneDefaults to None

Whether audio is in video

video_second_per_grid

torch.Tensor | NoneDefaults to None

Seconds per grid for videos

logits_to_keep

Union[int, torch.Tensor]Defaults to 0

If > 0, only compute logits for the last logits_to_keep token positions (0 = all positions). Enables memory-efficient fused cross-entropy by letting the recipe request a single-position lm_head projection alongside the final hidden states.

output_hidden_states

Optional[bool]Defaults to None

When set, the returned output carries the final hidden states (the input to lm_head) so the recipe can run fused linear cross-entropy.

**attn_kwargs

AnyDefaults to {}

Additional attention arguments

Returns: torch.Tensor | dict | CausalLMOutputWithPast

Logits tensor, a dict with loss/aux_loss if labels provided, or a

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.from_config(
    config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeThinkerConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)

classmethod

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)

classmethod

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.get_input_embeddings()

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.get_output_embeddings()

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.set_input_embeddings(
    value
)

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.set_output_embeddings(
    new_embeddings
)

class nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel(
    config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeTextConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    moe_overrides: dict | None = None
)

Bases: Module

Qwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers.

embed_tokens

layers

moe_config

= moe_config or MoEConfig(**moe_defaults)

norm

padding_idx

= getattr(config, 'pad_token_id', None)

rotary_emb

= Qwen3OmniMoeThinkerTextRotaryEmbedding(config)

vocab_size

= config.vocab_size

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel._deepstack_process(
    hidden_states,
    visual_pos_masks,
    visual_embeds
)

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel.forward(
    input_ids: torch.Tensor | None = None,
    inputs_embeds: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    visual_pos_masks: torch.Tensor | None = None,
    deepstack_visual_embeds: list[torch.Tensor] | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor

visual_pos_masks (torch.Tensor of shape (batch_size, seqlen), optional): The mask of the visual positions. deepstack_visual_embeds (list[torch.Tensor], optional): The deepstack visual embeddings. The shape is (num_layers, visual_seqlen, embed_dim). The feature is extracted from the different visual encoder layers, and fed to the decoder hidden states. It’s from the paper DeepStack(https://arxiv.org/abs/2406.04334).

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel.init_weights(
    buffer_device: torch.device | None = None
) -> None

nemo_automodel.components.models.qwen3_omni_moe.model.ModelClass = Qwen3OmniMoeThinkerForConditionalGeneration