bridge.models.qwen_audio.modeling_qwen2_audio#

Qwen2-Audio Model for Megatron.

This module provides the Qwen2AudioModel class that combines:

  • HuggingFace’s audio encoder (audio_tower) for processing mel spectrograms

  • HuggingFace’s multimodal projector for audio-to-language projection

  • Megatron’s language model for text generation

Reference: https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct

Module Contents#

Classes#

Qwen2AudioModel

Qwen2-Audio Model wrapper for Megatron.

API#

class bridge.models.qwen_audio.modeling_qwen2_audio.Qwen2AudioModel(
config: megatron.bridge.models.gpt_provider.GPTModelProvider,
pre_process: bool = True,
post_process: bool = True,
vp_stage: Optional[int] = None,
)#

Bases: megatron.core.transformer.module.MegatronModule

Qwen2-Audio Model wrapper for Megatron.

This class combines HuggingFace’s audio components with Megatron’s language model:

  • Audio tower (HF): Processes mel spectrograms through Whisper-like encoder

  • Multimodal projector (HF): Projects audio features to language model space

  • Language model (Megatron): Generates text conditioned on audio and text inputs

The audio encoder forward pass uses HuggingFace implementation, while the language model forward pass uses Megatron’s optimized implementation.

Parameters:
  • config (GPTModelProvider) – Model provider containing configuration for language and audio modules.

  • pre_process (bool, optional) – Whether to construct the audio tower and projector. Default: True.

  • post_process (bool, optional) – Whether to apply post-processing. Default: True.

  • vp_stage (Optional[int], optional) – Pipeline stage for model parallelism. Default: None.

.. attribute:: pre_process

If True, enables audio and multimodal components.

Type:

bool

.. attribute:: post_process

If True, enables post-processing.

Type:

bool

.. attribute:: vp_stage

Pipeline stage for model parallelism.

Type:

Optional[int]

.. attribute:: audio_tower

Audio encoder from HuggingFace (Whisper-like).

Type:

nn.Module

.. attribute:: multi_modal_projector

Projects audio features to language model space.

Type:

nn.Module

.. attribute:: language_model

Megatron language model.

Type:

nn.Module

Forward Inputs: input_ids (torch.LongTensor, optional): Tokenized input ids for the language model. attention_mask (torch.Tensor, optional): Attention mask for the language model. position_ids (torch.LongTensor, optional): Position ids for the language model. inputs_embeds (torch.FloatTensor, optional): Precomputed input embeddings. input_features (torch.Tensor, optional): Mel spectrogram features for audio. feature_attention_mask (torch.Tensor, optional): Attention mask for audio features. labels (torch.Tensor, optional): Target labels for supervised training. runtime_gather_output (bool, optional): If True, gather outputs across pipeline stages. loss_mask (Tensor, optional): Mask for loss computation.

Returns:

Model output (e.g., logits or loss, depending on mode).

Return type:

Tensor

.. note::

  • If pre_process is False, only the language model is constructed.

  • The audio tower and projector are only active if pre_process is True.

  • This class is intended for use within the Megatron-LM framework.

Initialization

set_input_tensor(input_tensor) None#

Set model chunk input tensor.

forward(
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
input_features: Optional[torch.Tensor] = None,
feature_attention_mask: Optional[torch.Tensor] = None,
labels: Optional[torch.Tensor] = None,
runtime_gather_output: Optional[bool] = None,
packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
*,
loss_mask: Optional[torch.Tensor] = None,
) torch.Tensor#

Forward pass combining HuggingFace audio encoder with Megatron language model.

Parameters:
  • input_ids – Tokenized input ids for the language model.

  • attention_mask – Attention mask for the language model.

  • position_ids – Position ids for the language model.

  • inputs_embeds – Precomputed input embeddings.

  • input_features – Mel spectrogram features for audio input.

  • feature_attention_mask – Attention mask for audio features.

  • labels – Target labels for supervised training.

  • runtime_gather_output – If True, gather outputs across pipeline stages.

  • loss_mask – Mask for loss computation.

Returns:

Model output containing logits or loss.

Return type:

Tensor

freeze(
freeze_language_model: bool,
freeze_audio_model: bool,
freeze_audio_projection: bool,
)#

Freeze model modules.

Make specific modules non-trainable by setting requires_grad to False.

Parameters:
  • freeze_language_model (bool) – Freeze the language model module.

  • freeze_audio_model (bool) – Freeze the audio model module (audio_tower).

  • freeze_audio_projection (bool) – Freeze the audio projection module (multi_modal_projector).