`nemo_automodel.components.models.qwen3_omni_moe.model`#

Module Contents#

Classes#

`Qwen3OmniMoeThinkerTextModel`	Qwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers.
`Qwen3OmniMoeThinkerForConditionalGeneration`	Qwen3OmniMoe Thinker for Conditional Generation with multimodal support.

Data#

ModelClass

API#

class nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel( config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeTextConfig, backend: nemo_automodel.components.moe.utils.BackendConfig, *, moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None, )#

Bases: torch.nn.Module

Qwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers.

Initialization

forward(

input_ids: torch.Tensor | None = None,

*,

inputs_embeds: torch.Tensor | None = None,

position_ids: torch.Tensor | None = None,

attention_mask: torch.Tensor | None = None,

padding_mask: torch.Tensor | None = None,

visual_pos_masks: torch.Tensor | None = None,

deepstack_visual_embeds: list[torch.Tensor] | None = None,

**attn_kwargs: Any,

) → torch.Tensor#: visual_pos_masks (torch.Tensor of shape (batch_size, seqlen), optional): The mask of the visual positions. deepstack_visual_embeds (list[torch.Tensor], optional): The deepstack visual embeddings. The shape is (num_layers, visual_seqlen, embed_dim). The feature is extracted from the different visual encoder layers, and fed to the decoder hidden states. It’s from the paper DeepStack(https://arxiv.org/abs/2406.04334).

_deepstack_process(hidden_states, visual_pos_masks, visual_embeds)#

init_weights(buffer_device: torch.device | None = None) → None#

class nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration(

config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeThinkerConfig,

moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,

backend: nemo_automodel.components.moe.utils.BackendConfig | None = None,

**kwargs,

)#

Bases: transformers.models.qwen3_omni_moe.modeling_qwen3_omni_moe.Qwen3OmniMoeThinkerForConditionalGeneration, nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixin

Qwen3OmniMoe Thinker for Conditional Generation with multimodal support.

Initialization

classmethod from_config(

config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeThinkerConfig,

moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,

backend: nemo_automodel.components.moe.utils.BackendConfig | None = None,

**kwargs,

)#

classmethod from_pretrained(

pretrained_model_name_or_path: str,

*model_args,

**kwargs,

)#

get_input_embeddings()#

set_input_embeddings(value)#

forward(

input_ids: torch.Tensor,

input_features: torch.FloatTensor | None = None,

pixel_values: torch.FloatTensor | None = None,

pixel_values_videos: torch.FloatTensor | None = None,

image_grid_thw: torch.LongTensor | None = None,

video_grid_thw: torch.LongTensor | None = None,

attention_mask: torch.Tensor | None = None,

feature_attention_mask: torch.LongTensor | None = None,

audio_feature_lengths: torch.LongTensor | None = None,

position_ids: torch.Tensor | None = None,

padding_mask: torch.Tensor | None = None,

inputs_embeds: torch.FloatTensor | None = None,

labels: torch.LongTensor | None = None,

output_router_logits: bool | None = None,

use_audio_in_video: bool | None = None,

video_second_per_grid: torch.Tensor | None = None,

**attn_kwargs: Any,

) → torch.Tensor | dict#

Forward pass with multimodal fusion.

Parameters:

input_ids – Input token IDs
input_features – Audio input features
pixel_values – Image pixel values
pixel_values_videos – Video pixel values
image_grid_thw – Image grid (temporal, height, width)
video_grid_thw – Video grid (temporal, height, width)
attention_mask – Attention mask
feature_attention_mask – Feature attention mask for audio
audio_feature_lengths – Audio feature lengths
position_ids – Position IDs (3D for MRoPE)
padding_mask – Padding mask
inputs_embeds – Optional pre-computed input embeddings
labels – Labels for loss computation
output_router_logits – Whether to output router logits
use_audio_in_video – Whether audio is in video
video_second_per_grid – Seconds per grid for videos
**attn_kwargs – Additional attention arguments

Returns:

Logits tensor or dict with loss/aux_loss if labels provided

initialize_weights( buffer_device: torch.device | None = None, dtype: torch.dtype = torch.bfloat16, ) → None#

nemo_automodel.components.models.qwen3_omni_moe.model.ModelClass#: None

nemo_automodel.components.models.qwen3_omni_moe.model#

Module Contents#

Classes#

Data#

API#

`nemo_automodel.components.models.qwen3_omni_moe.model`#