`nemo_automodel.components.models.qwen2_5_omni.model`#

Qwen2.5-Omni Thinker for ASR / multimodal text generation.

Qwen2.5-Omni is the dense predecessor of Qwen3-Omni-Moe. For NeMo AutoModel we only train the Thinker (audio + image + video + text); the talker and token2wav components are dropped from the loaded checkpoint by

class:: Qwen2_5OmniStateDictAdapter.

Compared with :mod:nemo_automodel.components.models.qwen3_omni_moe.model, this module is intentionally minimal:

inherits HF’s Qwen2_5OmniThinkerForConditionalGeneration directly (the text backbone is a standard dense Qwen2 transformer with MRoPE, so no custom rewrite is needed);
adds :class:HFCheckpointingMixin for NeMo-compatible save/load;
attaches :class:Qwen2_5OmniStateDictAdapter for thinker.* prefix handling;
does NOT inherit MoEFSDPSyncMixin (dense, no experts).

Module Contents#

Classes#

Qwen2_5OmniThinkerForConditionalGeneration

Qwen2.5-Omni Thinker (audio + image + video + text → text).

Functions#

_resolve_thinker_config

Return the thinker sub-config regardless of whether a full Omni or Thinker-only config was passed in.

Data#

ModelClass

API#

nemo_automodel.components.models.qwen2_5_omni.model._resolve_thinker_config( config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig, ) → transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig#: Return the thinker sub-config regardless of whether a full Omni or Thinker-only config was passed in.

class nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration(

config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,

backend: nemo_automodel.components.models.common.BackendConfig | None = None,

**kwargs,

)#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration

Qwen2.5-Omni Thinker (audio + image + video + text → text).

Initialization

classmethod from_pretrained(

pretrained_model_name_or_path: str,

*model_args,

backend: nemo_automodel.components.models.common.BackendConfig | None = None,

**kwargs,

)#

classmethod from_config(

config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,

backend: nemo_automodel.components.models.common.BackendConfig | None = None,

**kwargs,

)#

forward(

input_ids: torch.Tensor | None = None,

attention_mask: torch.Tensor | None = None,

input_features: torch.FloatTensor | None = None,

feature_attention_mask: torch.LongTensor | None = None,

audio_feature_lengths: torch.LongTensor | None = None,

pixel_values: torch.FloatTensor | None = None,

pixel_values_videos: torch.FloatTensor | None = None,

image_grid_thw: torch.LongTensor | None = None,

video_grid_thw: torch.LongTensor | None = None,

video_second_per_grid: torch.Tensor | None = None,

use_audio_in_video: bool | None = None,

position_ids: torch.Tensor | None = None,

inputs_embeds: torch.FloatTensor | None = None,

labels: torch.LongTensor | None = None,

**kwargs: Any,

)#

Delegate to HF Thinker forward, passing all multimodal inputs through.

The forward signature mirrors HF’s Qwen2_5OmniThinkerForConditionalGeneration.forward; we override only to make the call site uniform with the rest of NeMo AutoModel. Audio is mandatory for ASR; image / video paths are kept enabled so the same class supports the full Thinker modality set.

nemo_automodel.components.models.qwen2_5_omni.model.ModelClass#: None

nemo_automodel.components.models.qwen2_5_omni.model#