bridge.models.qwen_omni.modeling_qwen25_omni.thinker_model#

Module Contents#

Classes#

Qwen25OmniThinkerModel

Qwen2.5 Omni Thinker Model.

API#

class bridge.models.qwen_omni.modeling_qwen25_omni.thinker_model.Qwen25OmniThinkerModel(
language_transformer_config: megatron.bridge.models.qwen_omni.modeling_qwen25_omni.transformer_config.Qwen25OmniTransformerConfig,
language_transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
thinker_transformer_config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
parallel_output: bool = True,
pre_process: bool = True,
post_process: bool = True,
add_encoder: bool = True,
add_decoder: bool = True,
pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None = None,
)#

Bases: megatron.core.transformer.MegatronModule

Qwen2.5 Omni Thinker Model.

Key differences from Qwen3OmniMoeThinkerModel:

  • Uses HF vision encoder (Qwen2_5OmniVisionEncoder) directly, not Megatron-native

  • Uses HF audio encoder (Qwen2_5OmniAudioEncoder) directly

  • No deepstack visual embeddings

  • Vision embeddings inserted only at input level

  • Dense LLM (Qwen2 architecture), not MoE

Initialization

shared_embedding_or_output_weight()#

This is a convenience method to surface the language model’s word embeddings, which is necessary for finalize_model_grads._allreduce_word_embedding_grads.

set_input_tensor(input_tensor) None#
freeze(
freeze_language_model: bool = False,
freeze_vision_model: bool = False,
freeze_audio_model: bool = False,
)#

Freeze model modules.

Parameters:
  • freeze_language_model (bool) – Freeze the language model module.

  • freeze_vision_model (bool) – Freeze the vision model module.

  • freeze_audio_model (bool) – Freeze the audio model module.

get_audio_features(
input_features: torch.FloatTensor,
feature_attention_mask: torch.LongTensor | None = None,
audio_feature_lengths: torch.LongTensor | None = None,
)#
forward(
input_ids: torch.Tensor,
input_features=None,
position_ids: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
labels: torch.Tensor | None = None,
loss_mask: torch.Tensor | None = None,
inference_params: megatron.core.InferenceParams | None = None,
packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams | None = None,
extra_block_kwargs: dict | None = None,
pixel_values: torch.Tensor | None = None,
pixel_values_videos: torch.Tensor | None = None,
image_grid_thw: torch.Tensor | None = None,
video_grid_thw: torch.Tensor | None = None,
image_input_mask: torch.Tensor | None = None,
video_input_mask: torch.Tensor | None = None,
feature_attention_mask=None,
audio_feature_lengths=None,
cp_img_num: list[int] | None = None,
images_padded: list[bool] | None = None,
use_audio_in_video=None,
video_second_per_grid=None,
**kwargs,
) torch.Tensor#