nemo_automodel.components.models.qwen3_omni_moe.model#

Module Contents#

Classes#

Qwen3OmniMoeThinkerTextModel

Qwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers.

Qwen3OmniMoeThinkerForConditionalGeneration

Qwen3OmniMoe Thinker for Conditional Generation with multimodal support.

Data#

API#

class nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel(
config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeTextConfig,
backend: nemo_automodel.components.moe.utils.BackendConfig,
*,
moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
)#

Bases: torch.nn.Module

Qwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers.

Initialization

forward(
input_ids: torch.Tensor | None = None,
*,
inputs_embeds: torch.Tensor | None = None,
position_ids: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
visual_pos_masks: torch.Tensor | None = None,
deepstack_visual_embeds: list[torch.Tensor] | None = None,
**attn_kwargs: Any,
) torch.Tensor#

visual_pos_masks (torch.Tensor of shape (batch_size, seqlen), optional): The mask of the visual positions. deepstack_visual_embeds (list[torch.Tensor], optional): The deepstack visual embeddings. The shape is (num_layers, visual_seqlen, embed_dim). The feature is extracted from the different visual encoder layers, and fed to the decoder hidden states. It’s from the paper DeepStack(https://arxiv.org/abs/2406.04334).

_deepstack_process(hidden_states, visual_pos_masks, visual_embeds)#
init_weights(buffer_device: torch.device | None = None) None#
class nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration(
config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeThinkerConfig,
moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
backend: nemo_automodel.components.moe.utils.BackendConfig | None = None,
**kwargs,
)#

Bases: transformers.models.qwen3_omni_moe.modeling_qwen3_omni_moe.Qwen3OmniMoeThinkerForConditionalGeneration, nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixin

Qwen3OmniMoe Thinker for Conditional Generation with multimodal support.

Initialization

classmethod from_config(
config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeThinkerConfig,
moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
backend: nemo_automodel.components.moe.utils.BackendConfig | None = None,
**kwargs,
)#
classmethod from_pretrained(
pretrained_model_name_or_path: str,
*model_args,
**kwargs,
)#
get_input_embeddings()#
set_input_embeddings(value)#
forward(
input_ids: torch.Tensor,
input_features: torch.FloatTensor | None = None,
pixel_values: torch.FloatTensor | None = None,
pixel_values_videos: torch.FloatTensor | None = None,
image_grid_thw: torch.LongTensor | None = None,
video_grid_thw: torch.LongTensor | None = None,
attention_mask: torch.Tensor | None = None,
feature_attention_mask: torch.LongTensor | None = None,
audio_feature_lengths: torch.LongTensor | None = None,
position_ids: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
inputs_embeds: torch.FloatTensor | None = None,
labels: torch.LongTensor | None = None,
output_router_logits: bool | None = None,
use_audio_in_video: bool | None = None,
video_second_per_grid: torch.Tensor | None = None,
**attn_kwargs: Any,
) torch.Tensor | dict#

Forward pass with multimodal fusion.

Parameters:
  • input_ids – Input token IDs

  • input_features – Audio input features

  • pixel_values – Image pixel values

  • pixel_values_videos – Video pixel values

  • image_grid_thw – Image grid (temporal, height, width)

  • video_grid_thw – Video grid (temporal, height, width)

  • attention_mask – Attention mask

  • feature_attention_mask – Feature attention mask for audio

  • audio_feature_lengths – Audio feature lengths

  • position_ids – Position IDs (3D for MRoPE)

  • padding_mask – Padding mask

  • inputs_embeds – Optional pre-computed input embeddings

  • labels – Labels for loss computation

  • output_router_logits – Whether to output router logits

  • use_audio_in_video – Whether audio is in video

  • video_second_per_grid – Seconds per grid for videos

  • **attn_kwargs – Additional attention arguments

Returns:

Logits tensor or dict with loss/aux_loss if labels provided

initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16,
) None#
nemo_automodel.components.models.qwen3_omni_moe.model.ModelClass#

None