nemo_automodel.components.models.qwen2_5_omni.model

Qwen2.5-Omni Thinker for ASR / multimodal text generation.

Qwen2.5-Omni is the dense predecessor of Qwen3-Omni-Moe. For NeMo AutoModel we only train the Thinker (audio + image + video + text); the talker and token2wav components are dropped from the loaded checkpoint by :class:Qwen2_5OmniStateDictAdapter.

Compared with :mod:nemo_automodel.components.models.qwen3_omni_moe.model, this module is intentionally minimal:

inherits HF’s Qwen2_5OmniThinkerForConditionalGeneration directly (the text backbone is a standard dense Qwen2 transformer with MRoPE, so no custom rewrite is needed);
adds :class:HFCheckpointingMixin for NeMo-compatible save/load;
attaches :class:Qwen2_5OmniStateDictAdapter for thinker.* prefix handling;
does NOT inherit MoEFSDPSyncMixin (dense, no experts).

Module Contents

Classes

Name	Description
`Qwen2_5OmniThinkerForConditionalGeneration`	Qwen2.5-Omni Thinker (audio + image + video + text → text).

Functions

Name	Description
`_resolve_thinker_config`	Return the thinker sub-config regardless of whether a full Omni or

Data

ModelClass

API

class nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration(
    config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)

Bases: HFCheckpointingMixin, HFQwen2_5OmniThinkerForConditionalGeneration

Qwen2.5-Omni Thinker (audio + image + video + text → text).

backend

= backend or BackendConfig()

state_dict_adapter

nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration.forward(
    input_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    input_features: torch.FloatTensor | None = None,
    feature_attention_mask: torch.LongTensor | None = None,
    audio_feature_lengths: torch.LongTensor | None = None,
    pixel_values: torch.FloatTensor | None = None,
    pixel_values_videos: torch.FloatTensor | None = None,
    image_grid_thw: torch.LongTensor | None = None,
    video_grid_thw: torch.LongTensor | None = None,
    video_second_per_grid: torch.Tensor | None = None,
    use_audio_in_video: bool | None = None,
    position_ids: torch.Tensor | None = None,
    past_key_values: typing.Any = None,
    inputs_embeds: torch.FloatTensor | None = None,
    labels: torch.LongTensor | None = None,
    use_cache: bool | None = None,
    rope_deltas: torch.LongTensor | None = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    output_hidden_states: typing.Optional[bool] = None,
    kwargs: typing.Any = {}
)

Multimodal forward that mirrors HF’s Thinker but supports cut-CE.

This re-implements the body of HF’s Qwen2_5OmniThinkerForConditionalGeneration.forward (same audio/image/video embedding merge and MRoPE index computation) so we can (a) gate the lm_head projection on logits_to_keep and (b) surface the FINAL hidden states (the lm_head input) on the returned :class:~transformers.modeling_outputs.CausalLMOutputWithPast. Together these let the recipe enable :class:FusedLinearCrossEntropy (cut-CE): it checks logits_to_keep is in the signature and that the output carries hidden_states.

Audio is mandatory for ASR; image / video paths are kept enabled so the same class supports the full Thinker modality set.

Parameters:

logits_to_keep

Union[int, torch.Tensor]Defaults to 0

If 0 (default), project all positions (no slice — DTensor cannot slice a full range). Otherwise compute logits only for the last logits_to_keep positions before lm_head.

output_hidden_states

Optional[bool]Defaults to None

When set, the returned output carries the final hidden states spanning the full sequence.

Returns:

class:~transformers.modeling_outputs.CausalLMOutputWithPast with

nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration.from_config(
    config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)

classmethod

nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)

classmethod

nemo_automodel.components.models.qwen2_5_omni.model._resolve_thinker_config(
    config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig
) -> transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig

Return the thinker sub-config regardless of whether a full Omni or Thinker-only config was passed in.

nemo_automodel.components.models.qwen2_5_omni.model.ModelClass = Qwen2_5OmniThinkerForConditionalGeneration