nemo_automodel.components.models.qwen2_5_omni.model

View as Markdown

Qwen2.5-Omni Thinker for ASR / multimodal text generation.

Qwen2.5-Omni is the dense predecessor of Qwen3-Omni-Moe. For NeMo AutoModel we only train the Thinker (audio + image + video + text); the talker and token2wav components are dropped from the loaded checkpoint by :class:Qwen2_5OmniStateDictAdapter.

Compared with :mod:nemo_automodel.components.models.qwen3_omni_moe.model, this module is intentionally minimal:

  • inherits HF’s Qwen2_5OmniThinkerForConditionalGeneration directly (the text backbone is a standard dense Qwen2 transformer with MRoPE, so no custom rewrite is needed);
  • adds :class:HFCheckpointingMixin for NeMo-compatible save/load;
  • attaches :class:Qwen2_5OmniStateDictAdapter for thinker.* prefix handling;
  • does NOT inherit MoEFSDPSyncMixin (dense, no experts).

Module Contents

Classes

NameDescription
Qwen2_5OmniThinkerForConditionalGenerationQwen2.5-Omni Thinker (audio + image + video + text → text).

Functions

NameDescription
_resolve_thinker_configReturn the thinker sub-config regardless of whether a full Omni or

Data

ModelClass

API

class nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration(
config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)

Bases: HFCheckpointingMixin, HFQwen2_5OmniThinkerForConditionalGeneration

Qwen2.5-Omni Thinker (audio + image + video + text → text).

backend
= backend or BackendConfig()
state_dict_adapter
nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration.forward(
input_ids: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
input_features: torch.FloatTensor | None = None,
feature_attention_mask: torch.LongTensor | None = None,
audio_feature_lengths: torch.LongTensor | None = None,
pixel_values: torch.FloatTensor | None = None,
pixel_values_videos: torch.FloatTensor | None = None,
image_grid_thw: torch.LongTensor | None = None,
video_grid_thw: torch.LongTensor | None = None,
video_second_per_grid: torch.Tensor | None = None,
use_audio_in_video: bool | None = None,
position_ids: torch.Tensor | None = None,
past_key_values: typing.Any = None,
inputs_embeds: torch.FloatTensor | None = None,
labels: torch.LongTensor | None = None,
use_cache: bool | None = None,
rope_deltas: torch.LongTensor | None = None,
logits_to_keep: typing.Union[int, torch.Tensor] = 0,
output_hidden_states: typing.Optional[bool] = None,
kwargs: typing.Any = {}
)

Multimodal forward that mirrors HF’s Thinker but supports cut-CE.

This re-implements the body of HF’s Qwen2_5OmniThinkerForConditionalGeneration.forward (same audio/image/video embedding merge and MRoPE index computation) so we can (a) gate the lm_head projection on logits_to_keep and (b) surface the FINAL hidden states (the lm_head input) on the returned :class:~transformers.modeling_outputs.CausalLMOutputWithPast. Together these let the recipe enable :class:FusedLinearCrossEntropy (cut-CE): it checks logits_to_keep is in the signature and that the output carries hidden_states.

Audio is mandatory for ASR; image / video paths are kept enabled so the same class supports the full Thinker modality set.

Parameters:

logits_to_keep
Union[int, torch.Tensor]Defaults to 0

If 0 (default), project all positions (no slice — DTensor cannot slice a full range). Otherwise compute logits only for the last logits_to_keep positions before lm_head.

output_hidden_states
Optional[bool]Defaults to None

When set, the returned output carries the final hidden states spanning the full sequence.

Returns:

class:~transformers.modeling_outputs.CausalLMOutputWithPast with

nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration.from_config(
config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)
classmethod
nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path: str,
model_args = (),
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)
classmethod
nemo_automodel.components.models.qwen2_5_omni.model._resolve_thinker_config(
config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig
) -> transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig

Return the thinker sub-config regardless of whether a full Omni or Thinker-only config was passed in.

nemo_automodel.components.models.qwen2_5_omni.model.ModelClass = Qwen2_5OmniThinkerForConditionalGeneration