nemo_automodel.components.models.qwen3_omni_moe.model#
Module Contents#
Classes#
Qwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers. |
|
Qwen3OmniMoe Thinker for Conditional Generation with multimodal support. |
Data#
API#
- class nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel(
- config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeTextConfig,
- backend: nemo_automodel.components.moe.utils.BackendConfig,
- *,
- moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
Bases:
torch.nn.ModuleQwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers.
Initialization
- forward(
- input_ids: torch.Tensor | None = None,
- *,
- inputs_embeds: torch.Tensor | None = None,
- position_ids: torch.Tensor | None = None,
- attention_mask: torch.Tensor | None = None,
- padding_mask: torch.Tensor | None = None,
- visual_pos_masks: torch.Tensor | None = None,
- deepstack_visual_embeds: list[torch.Tensor] | None = None,
- **attn_kwargs: Any,
visual_pos_masks (
torch.Tensorof shape(batch_size, seqlen), optional): The mask of the visual positions. deepstack_visual_embeds (list[torch.Tensor], optional): The deepstack visual embeddings. The shape is (num_layers, visual_seqlen, embed_dim). The feature is extracted from the different visual encoder layers, and fed to the decoder hidden states. Itβs from the paper DeepStack(https://arxiv.org/abs/2406.04334).
- _deepstack_process(hidden_states, visual_pos_masks, visual_embeds)#
- init_weights(buffer_device: torch.device | None = None) None#
- class nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration(
- config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeThinkerConfig,
- moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
- backend: nemo_automodel.components.moe.utils.BackendConfig | None = None,
- **kwargs,
Bases:
transformers.models.qwen3_omni_moe.modeling_qwen3_omni_moe.Qwen3OmniMoeThinkerForConditionalGeneration,nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixinQwen3OmniMoe Thinker for Conditional Generation with multimodal support.
Initialization
- classmethod from_config(
- config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeThinkerConfig,
- moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
- backend: nemo_automodel.components.moe.utils.BackendConfig | None = None,
- **kwargs,
- classmethod from_pretrained(
- pretrained_model_name_or_path: str,
- *model_args,
- **kwargs,
- get_input_embeddings()#
- set_input_embeddings(value)#
- forward(
- input_ids: torch.Tensor,
- input_features: torch.FloatTensor | None = None,
- pixel_values: torch.FloatTensor | None = None,
- pixel_values_videos: torch.FloatTensor | None = None,
- image_grid_thw: torch.LongTensor | None = None,
- video_grid_thw: torch.LongTensor | None = None,
- attention_mask: torch.Tensor | None = None,
- feature_attention_mask: torch.LongTensor | None = None,
- audio_feature_lengths: torch.LongTensor | None = None,
- position_ids: torch.Tensor | None = None,
- padding_mask: torch.Tensor | None = None,
- inputs_embeds: torch.FloatTensor | None = None,
- labels: torch.LongTensor | None = None,
- output_router_logits: bool | None = None,
- use_audio_in_video: bool | None = None,
- video_second_per_grid: torch.Tensor | None = None,
- **attn_kwargs: Any,
Forward pass with multimodal fusion.
- Parameters:
input_ids β Input token IDs
input_features β Audio input features
pixel_values β Image pixel values
pixel_values_videos β Video pixel values
image_grid_thw β Image grid (temporal, height, width)
video_grid_thw β Video grid (temporal, height, width)
attention_mask β Attention mask
feature_attention_mask β Feature attention mask for audio
audio_feature_lengths β Audio feature lengths
position_ids β Position IDs (3D for MRoPE)
padding_mask β Padding mask
inputs_embeds β Optional pre-computed input embeddings
labels β Labels for loss computation
output_router_logits β Whether to output router logits
use_audio_in_video β Whether audio is in video
video_second_per_grid β Seconds per grid for videos
**attn_kwargs β Additional attention arguments
- Returns:
Logits tensor or dict with loss/aux_loss if labels provided
- initialize_weights(
- buffer_device: torch.device | None = None,
- dtype: torch.dtype = torch.bfloat16,
- nemo_automodel.components.models.qwen3_omni_moe.model.ModelClass#
None