bridge.models.qwen_omni.modeling_qwen25_omni.thinker_model#
Module Contents#
Classes#
Qwen2.5 Omni Thinker Model. |
API#
- class bridge.models.qwen_omni.modeling_qwen25_omni.thinker_model.Qwen25OmniThinkerModel(
- language_transformer_config: megatron.bridge.models.qwen_omni.modeling_qwen25_omni.transformer_config.Qwen25OmniTransformerConfig,
- language_transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
- thinker_transformer_config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
- parallel_output: bool = True,
- pre_process: bool = True,
- post_process: bool = True,
- add_encoder: bool = True,
- add_decoder: bool = True,
- pg_collection: megatron.core.process_groups_config.ProcessGroupCollection | None = None,
Bases:
megatron.core.transformer.MegatronModuleQwen2.5 Omni Thinker Model.
Key differences from Qwen3OmniMoeThinkerModel:
Uses HF vision encoder (Qwen2_5OmniVisionEncoder) directly, not Megatron-native
Uses HF audio encoder (Qwen2_5OmniAudioEncoder) directly
No deepstack visual embeddings
Vision embeddings inserted only at input level
Dense LLM (Qwen2 architecture), not MoE
Initialization
This is a convenience method to surface the language model’s word embeddings, which is necessary for
finalize_model_grads._allreduce_word_embedding_grads.
- set_input_tensor(input_tensor) None#
- freeze(
- freeze_language_model: bool = False,
- freeze_vision_model: bool = False,
- freeze_audio_model: bool = False,
Freeze model modules.
- Parameters:
freeze_language_model (bool) – Freeze the language model module.
freeze_vision_model (bool) – Freeze the vision model module.
freeze_audio_model (bool) – Freeze the audio model module.
- get_audio_features(
- input_features: torch.FloatTensor,
- feature_attention_mask: torch.LongTensor | None = None,
- audio_feature_lengths: torch.LongTensor | None = None,
- forward(
- input_ids: torch.Tensor,
- input_features=None,
- position_ids: torch.Tensor | None = None,
- attention_mask: torch.Tensor | None = None,
- labels: torch.Tensor | None = None,
- loss_mask: torch.Tensor | None = None,
- inference_params: megatron.core.InferenceParams | None = None,
- packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams | None = None,
- extra_block_kwargs: dict | None = None,
- pixel_values: torch.Tensor | None = None,
- pixel_values_videos: torch.Tensor | None = None,
- image_grid_thw: torch.Tensor | None = None,
- video_grid_thw: torch.Tensor | None = None,
- image_input_mask: torch.Tensor | None = None,
- video_input_mask: torch.Tensor | None = None,
- feature_attention_mask=None,
- audio_feature_lengths=None,
- cp_img_num: list[int] | None = None,
- images_padded: list[bool] | None = None,
- use_audio_in_video=None,
- video_second_per_grid=None,
- **kwargs,