nemo_automodel.components.models.qwen3_omni_moe.model

View as Markdown

Module Contents

Classes

NameDescription
Qwen3OmniMoeThinkerForConditionalGenerationQwen3OmniMoe Thinker for Conditional Generation with multimodal support.
Qwen3OmniMoeThinkerTextModelQwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers.

Data

ModelClass

API

class nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration(
config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeThinkerConfig,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)

Bases: HFCheckpointingMixin, HFQwen3OmniMoeThinkerForConditionalGeneration, MoEFSDPSyncMixin

Qwen3OmniMoe Thinker for Conditional Generation with multimodal support.

lm_head
model
num_experts
= text_config.num_experts
num_experts_per_tok
= text_config.num_experts_per_tok
pad_token_id
router_aux_loss_coef
= getattr(text_config, 'router_aux_loss_coef', 0.0)
spatial_merge_size
state_dict_adapter
vocab_size
= text_config.vocab_size
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.forward(
input_ids: torch.Tensor,
input_features: torch.FloatTensor | None = None,
pixel_values: torch.FloatTensor | None = None,
pixel_values_videos: torch.FloatTensor | None = None,
image_grid_thw: torch.LongTensor | None = None,
video_grid_thw: torch.LongTensor | None = None,
attention_mask: torch.Tensor | None = None,
feature_attention_mask: torch.LongTensor | None = None,
audio_feature_lengths: torch.LongTensor | None = None,
position_ids: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
inputs_embeds: torch.FloatTensor | None = None,
labels: torch.LongTensor | None = None,
output_router_logits: bool | None = None,
use_audio_in_video: bool | None = None,
video_second_per_grid: torch.Tensor | None = None,
logits_to_keep: typing.Union[int, torch.Tensor] = 0,
output_hidden_states: typing.Optional[bool] = None,
attn_kwargs: typing.Any = {}
) -> torch.Tensor | dict | transformers.modeling_outputs.CausalLMOutputWithPast

Forward pass with multimodal fusion.

Parameters:

input_ids
torch.Tensor

Input token IDs

input_features
torch.FloatTensor | NoneDefaults to None

Audio input features

pixel_values
torch.FloatTensor | NoneDefaults to None

Image pixel values

pixel_values_videos
torch.FloatTensor | NoneDefaults to None

Video pixel values

image_grid_thw
torch.LongTensor | NoneDefaults to None

Image grid (temporal, height, width)

video_grid_thw
torch.LongTensor | NoneDefaults to None

Video grid (temporal, height, width)

attention_mask
torch.Tensor | NoneDefaults to None

Attention mask

feature_attention_mask
torch.LongTensor | NoneDefaults to None

Feature attention mask for audio

audio_feature_lengths
torch.LongTensor | NoneDefaults to None

Audio feature lengths

position_ids
torch.Tensor | NoneDefaults to None

Position IDs (3D for MRoPE)

padding_mask
torch.Tensor | NoneDefaults to None

Padding mask

inputs_embeds
torch.FloatTensor | NoneDefaults to None

Optional pre-computed input embeddings

labels
torch.LongTensor | NoneDefaults to None

Labels for loss computation

output_router_logits
bool | NoneDefaults to None

Whether to output router logits

use_audio_in_video
bool | NoneDefaults to None

Whether audio is in video

video_second_per_grid
torch.Tensor | NoneDefaults to None

Seconds per grid for videos

logits_to_keep
Union[int, torch.Tensor]Defaults to 0

If > 0, only compute logits for the last logits_to_keep token positions (0 = all positions). Enables memory-efficient fused cross-entropy by letting the recipe request a single-position lm_head projection alongside the final hidden states.

output_hidden_states
Optional[bool]Defaults to None

When set, the returned output carries the final hidden states (the input to lm_head) so the recipe can run fused linear cross-entropy.

**attn_kwargs
AnyDefaults to {}

Additional attention arguments

Returns: torch.Tensor | dict | CausalLMOutputWithPast

Logits tensor, a dict with loss/aux_loss if labels provided, or a

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.from_config(
config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeThinkerConfig,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)
classmethod
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path: str,
model_args = (),
kwargs = {}
)
classmethod
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.get_input_embeddings()
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.get_output_embeddings()
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16
) -> None
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.set_input_embeddings(
value
)
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerForConditionalGeneration.set_output_embeddings(
new_embeddings
)
class nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel(
config: transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe.Qwen3OmniMoeTextConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
moe_overrides: dict | None = None
)

Bases: Module

Qwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers.

embed_tokens
layers
moe_config
= moe_config or MoEConfig(**moe_defaults)
norm
padding_idx
= getattr(config, 'pad_token_id', None)
rotary_emb
= Qwen3OmniMoeThinkerTextRotaryEmbedding(config)
vocab_size
= config.vocab_size
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel._deepstack_process(
hidden_states,
visual_pos_masks,
visual_embeds
)
nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel.forward(
input_ids: torch.Tensor | None = None,
inputs_embeds: torch.Tensor | None = None,
position_ids: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
visual_pos_masks: torch.Tensor | None = None,
deepstack_visual_embeds: list[torch.Tensor] | None = None,
attn_kwargs: typing.Any = {}
) -> torch.Tensor

visual_pos_masks (torch.Tensor of shape (batch_size, seqlen), optional): The mask of the visual positions. deepstack_visual_embeds (list[torch.Tensor], optional): The deepstack visual embeddings. The shape is (num_layers, visual_seqlen, embed_dim). The feature is extracted from the different visual encoder layers, and fed to the decoder hidden states. It’s from the paper DeepStack(https://arxiv.org/abs/2406.04334).

nemo_automodel.components.models.qwen3_omni_moe.model.Qwen3OmniMoeThinkerTextModel.init_weights(
buffer_device: torch.device | None = None
) -> None
nemo_automodel.components.models.qwen3_omni_moe.model.ModelClass = Qwen3OmniMoeThinkerForConditionalGeneration