bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr#

Module Contents#

Classes#

Qwen3ASRTextRMSNorm

Qwen3ASRTextAttention

Multi-headed attention from ‘Attention Is All You Need’ paper

Qwen3ASRTextMLP

Qwen3ASRThinkerTextDecoderLayer

Qwen3ASRPreTrainedModel

Base pretrained model for Qwen3-ASR.

Qwen3ASRThinkerCausalLMOutputWithPast

param rope_deltas:

The rope index difference between sequence length and multimodal rope.

type rope_deltas:

torch.LongTensor of shape (batch_size, ), optional

Qwen3ASRPreTrainedModelForConditionalGeneration

Base pretrained model for Qwen3-ASR conditional generation.

Qwen3ASRAudioAttention

Multi-headed attention from ‘Attention Is All You Need’ paper

Qwen3ASRAudioEncoderLayer

SinusoidsPositionEmbedding

Qwen3ASRAudioEncoder

Qwen3ASRThinkerTextRotaryEmbedding

Qwen3ASRThinkerTextMLP

Qwen3ASRThinkerTextRMSNorm

Qwen3ASRThinkerTextAttention

Multi-headed attention from ‘Attention Is All You Need’ paper

Qwen3ASRThinkerTextModel

Text model component of the Qwen3-ASR thinker.

Qwen3ASRThinkerForConditionalGeneration

Qwen3-ASR thinker model for conditional generation.

Qwen3ASRThinkerTextPreTrainedModel

Base pretrained model for the Qwen3-ASR thinker text component.

Qwen3ASRForConditionalGeneration

Qwen3-ASR model for conditional generation.

Functions#

rotate_half

Rotates half the hidden dims of the input.

repeat_kv

This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)

eager_attention_forward

apply_rotary_pos_emb

Applies Rotary Position Embedding to the query and key tensors.

_get_feat_extract_output_lengths

Computes the output length of the convolutional layers and the output length of the audio encoder

Data#

API#

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRTextRMSNorm(hidden_size, eps: float = 1e-06)#

Bases: torch.nn.Module

Initialization

Qwen3ASRTextRMSNorm is equivalent to T5LayerNorm

forward(hidden_states: torch.Tensor) torch.Tensor#
extra_repr()#
bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.rotate_half(x)#

Rotates half the hidden dims of the input.

bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.repeat_kv(hidden_states: torch.Tensor, n_rep: int) torch.Tensor#

This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)

bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.eager_attention_forward(
module: torch.nn.Module,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
attention_mask: Optional[torch.Tensor],
scaling: float,
dropout: float = 0.0,
**kwargs: transformers.processing_utils.Unpack[transformers.utils.generic.TransformersKwargs],
)#
bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.apply_rotary_pos_emb(
q,
k,
cos,
sin,
position_ids=None,
unsqueeze_dim=1,
)#

Applies Rotary Position Embedding to the query and key tensors.

Parameters:
  • q (torch.Tensor) – The query tensor.

  • k (torch.Tensor) – The key tensor.

  • cos (torch.Tensor) – The cosine part of the rotary embedding.

  • sin (torch.Tensor) – The sine part of the rotary embedding.

  • position_ids (torch.Tensor, optional) – Deprecated and unused.

  • unsqueeze_dim (int, optional, defaults to 1) – The ‘unsqueeze_dim’ argument specifies the dimension along which to unsqueeze cos[position_ids] and sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.

Returns:

tuple(torch.Tensor) comprising of the query and key tensors rotated using the Rotary Position Embedding.

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRTextAttention(
config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig,
layer_idx: int,
)#

Bases: torch.nn.Module

Multi-headed attention from ‘Attention Is All You Need’ paper

Initialization

forward(
hidden_states: torch.Tensor,
position_embeddings: tuple[torch.Tensor, torch.Tensor],
attention_mask: Optional[torch.Tensor],
past_key_values: Optional[transformers.cache_utils.Cache] = None,
cache_position: Optional[torch.LongTensor] = None,
**kwargs: transformers.processing_utils.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs],
) tuple[torch.Tensor, Optional[torch.Tensor]]#
class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRTextMLP(config)#

Bases: torch.nn.Module

Initialization

forward(x)#
class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextDecoderLayer(
config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig,
layer_idx: int,
)#

Bases: transformers.modeling_layers.GradientCheckpointingLayer

Initialization

forward(
hidden_states: torch.Tensor,
position_embeddings: tuple[torch.Tensor, torch.Tensor],
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[transformers.cache_utils.Cache] = None,
use_cache: Optional[bool] = False,
cache_position: Optional[torch.LongTensor] = None,
**kwargs: transformers.processing_utils.Unpack[transformers.utils.generic.TransformersKwargs],
) torch.Tensor#
class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel#

Bases: transformers.modeling_utils.PreTrainedModel

Base pretrained model for Qwen3-ASR.

config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig#

None

base_model_prefix#

‘model’

supports_gradient_checkpointing#

True

_skip_keys_device_placement#

‘past_key_values’

_supports_flash_attn#

True

_supports_sdpa#

True

_can_compile_fullgraph#

True

_supports_attention_backend#

True

_can_record_outputs#

None

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerCausalLMOutputWithPast#

Bases: transformers.modeling_outputs.MoeCausalLMOutputWithPast

Parameters:

rope_deltas (torch.LongTensor of shape (batch_size, ), optional) – The rope index difference between sequence length and multimodal rope.

rope_deltas: Optional[torch.LongTensor]#

None

bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr._get_feat_extract_output_lengths(input_lengths)#

Computes the output length of the convolutional layers and the output length of the audio encoder

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModelForConditionalGeneration#

Bases: bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel

Base pretrained model for Qwen3-ASR conditional generation.

_prepare_4d_causal_attention_mask_with_cache_position(
attention_mask: torch.Tensor,
sequence_length: int,
target_length: int,
dtype: torch.dtype,
device: torch.device,
min_dtype: float,
cache_position: torch.Tensor,
batch_size: int,
)#

Creates a causal 4D mask of shape (batch_size, 1, query_length, key_value_length) from a 2D mask of shape (batch_size, key_value_length), or if the input attention_mask is already 4D, do nothing.

Parameters:
  • attention_mask (torch.Tensor) – A 2D attention mask of shape (batch_size, key_value_length) or a 4D attention mask of shape (batch_size, 1, query_length, key_value_length).

  • sequence_length (int) – The sequence length being processed.

  • target_length (int) – The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.

  • dtype (torch.dtype) – The dtype to use for the 4D attention mask.

  • device (torch.device) – The device to place the 4D attention mask on.

  • min_dtype (float) – The minimum value representable with the dtype dtype.

  • cache_position (torch.Tensor) – Indices depicting the position of the input sequence tokens in the sequence.

  • batch_size (torch.Tensor) – Batch size.

get_chunked_index(
token_indices: torch.Tensor,
tokens_per_chunk: int,
remove_index: int,
) list[tuple[int, int]]#

Splits token index list into chunks based on token value ranges.

Given a list of token indices, returns a list of (start, end) index tuples representing slices of the list where the token values fall within successive ranges of t_ntoken_per_chunk.

For example, if t_ntoken_per_chunk is 1000, the function will create chunks such that:

  • the first chunk contains token values < 1000,

  • the second chunk contains values >= 1000 and < 2000, and so on.

Parameters:
  • token_indices (torch.Tensor of shape (seq_len, )) – A monotonically increasing list of token index values.

  • t_ntoken_per_chunk (int) – Number of tokens per chunk (used as the chunk size threshold).

  • remove_index (int)

Returns:

A list of tuples, each representing the start (inclusive) and end (exclusive) indices of a chunk in token_indices.

Return type:

list[tuple[int, int]]

get_rope_index(
attention_mask: Optional[torch.Tensor] = None,
) tuple[torch.Tensor, torch.Tensor]#

Calculate the rope index in LLM.

Explanation: Each embedding sequence contains text embedding.

Parameters:
  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) – Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • audio_seqlens (torch.LongTensor of shape (num_audios), optional) – The length of feature shape of each audio in LLM.

Returns:

position_ids (torch.LongTensor of shape (3, batch_size, sequence_length)) mrope_position_deltas (torch.Tensor of shape (batch_size))

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRAudioAttention(config)#

Bases: torch.nn.Module

Multi-headed attention from ‘Attention Is All You Need’ paper

Initialization

forward(
hidden_states: torch.Tensor,
cu_seqlens: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
**kwargs,
) tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]#

Input shape: Batch x Time x Channel

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRAudioEncoderLayer(
config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRAudioEncoderConfig,
)#

Bases: transformers.modeling_layers.GradientCheckpointingLayer

Initialization

forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
**kwargs,
) torch.Tensor#
Parameters:
  • hidden_states (torch.FloatTensor) – input to the layer of shape (batch, seq_len, embed_dim)

  • attention_mask (torch.FloatTensor) – attention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values.

  • layer_head_mask (torch.FloatTensor) – mask for attention heads in a given layer of size (encoder_attention_heads,).

  • output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.SinusoidsPositionEmbedding(length, channels, max_timescale=10000)#

Bases: torch.nn.Module

Initialization

forward(seqlen: int)#
class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRAudioEncoder(
config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRAudioEncoderConfig,
)#

Bases: bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel

config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRAudioEncoderConfig#

None

main_input_name#

‘input_features’

_no_split_modules#

[‘Qwen3ASRAudioEncoderLayer’]

_supports_sdpa#

True

_freeze_parameters()#
get_input_embeddings() torch.nn.Module#
set_input_embeddings(value: torch.nn.Module)#
_prepare_attention_mask(
inputs_tensor: torch.Tensor,
cu_seqlens: torch.Tensor,
) torch.Tensor#
forward(input_features, feature_lens=None, aftercnn_lens=None)#

feature_lens (torch.LongTensor of shape (batch_size,)): mel length aftercnn_lens (torch.LongTensor of shape (batch_size,)): mel length after cnn

padded_and_mask_function(
tensor_list,
tensor_len,
padding_value=0,
padding_side='right',
)#

Pads a sequence of tensors to their maximum length on indicated padding_side. Then prepares a mask so that pad tokens are not attended to.

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextRotaryEmbedding(
config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig,
device=None,
)#

Bases: torch.nn.Module

Initialization

inv_freq: torch.Tensor#

None

apply_interleaved_mrope(freqs, mrope_section)#

Apply interleaved MRoPE to 3D rotary embeddings. Reorganizes frequency layout from chunked [TTT…HHH…WWW] to interleaved [THTHWHTHW…TT], preserving frequency continuity.

Parameters:
  • x – (3, bs, seq_len, head_dim // 2)

  • mrope_section – (3,)

Returns:

(bs, seq_len, head_dim // 2)

Return type:

x_t

forward(x, position_ids)#
class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextMLP(config, intermediate_size=None)#

Bases: torch.nn.Module

Initialization

forward(x)#
class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextRMSNorm(hidden_size, eps=1e-06)#

Bases: torch.nn.Module

Initialization

Qwen3ASRThinkerTextRMSNorm is equivalent to T5LayerNorm

forward(hidden_states)#
extra_repr()#
class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextAttention(config, layer_idx)#

Bases: torch.nn.Module

Multi-headed attention from ‘Attention Is All You Need’ paper

Initialization

forward(
hidden_states: torch.Tensor,
position_embeddings: tuple[torch.Tensor, torch.Tensor],
attention_mask: Optional[torch.Tensor],
past_key_values: Optional[transformers.cache_utils.Cache] = None,
cache_position: Optional[torch.LongTensor] = None,
**kwargs: transformers.processing_utils.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs],
) tuple[torch.Tensor, Optional[torch.Tensor]]#
class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextModel(
config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig,
)#

Bases: bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel

Text model component of the Qwen3-ASR thinker.

Initialization

config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig#

None

_no_split_modules#

[‘Qwen3ASRThinkerTextDecoderLayer’]

config_class#

None

_can_record_outputs#

None

forward(
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[transformers.cache_utils.Cache] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
**kwargs: transformers.processing_utils.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs],
) Union[tuple, transformers.modeling_outputs.BaseModelOutputWithPast]#
class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerForConditionalGeneration(config)#

Bases: bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModelForConditionalGeneration, transformers.generation.GenerationMixin

Qwen3-ASR thinker model for conditional generation.

Initialization

config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRThinkerConfig#

None

base_model_prefix#

‘thinker’

_tied_weights_keys#

[‘model.embed_tokens.weight’, ‘lm_head.weight’]

_no_split_modules#

[‘Qwen3ASRAudioEncoderLayer’, ‘Qwen3ASRThinkerTextDecoderLayer’]

_can_record_outputs#

None

get_input_embeddings()#
set_input_embeddings(value)#
get_audio_features(
input_features: torch.FloatTensor,
feature_attention_mask: Optional[torch.LongTensor] = None,
audio_feature_lengths: Optional[torch.LongTensor] = None,
)#

Encodes audios into continuous embeddings that can be forwarded to the language model.

Parameters:
  • input_features (torch.FloatTensor) – The tensors corresponding to the input audios.

  • feature_attention_mask (torch.LongTensor, optional) – Mask to avoid performing attention on padding feature indices. Mask values selected in [0, 1]:

  • audio_feature_lengths (torch.LongTensor of shape (num_audios), optional) – The length of feature shape of each audio in LLM.

get_placeholder_mask(
input_ids: torch.LongTensor,
inputs_embeds: torch.FloatTensor,
)#

Obtains multimodal placeholder mask from input_ids or inputs_embeds, and checks that the placeholder token count is equal to the length of multimodal features. If the lengths are different, an error is raised.

forward(
input_ids=None,
input_features=None,
attention_mask=None,
feature_attention_mask=None,
audio_feature_lengths=None,
position_ids=None,
past_key_values=None,
inputs_embeds=None,
rope_deltas=None,
labels=None,
use_cache=None,
cache_position=None,
**kwargs,
) Union[tuple, bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerCausalLMOutputWithPast]#

feature_attention_mask (torch.Tensor of shape (batch_size, feature_sequence_length), optional): Mask to avoid performing attention on padding feature indices. Mask values selected in [0, 1]: - 1 for tokens that are not masked, - 0 for tokens that are masked. audio_feature_lengths (torch.LongTensor of shape (num_audios), optional): The length of feature shape of each audio in LLM. rope_deltas (torch.LongTensor of shape (batch_size, ), optional): The rope index difference between sequence length and multimodal rope. labels (torch.LongTensor of shape (batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should either be in [0, ...,     config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

prepare_inputs_for_generation(
input_ids,
past_key_values=None,
attention_mask=None,
inputs_embeds=None,
cache_position=None,
position_ids=None,
use_cache=True,
input_features=None,
feature_attention_mask=None,
**kwargs,
)#
class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextPreTrainedModel#

Bases: transformers.modeling_utils.PreTrainedModel

Base pretrained model for the Qwen3-ASR thinker text component.

config#

None

base_model_prefix#

‘model’

supports_gradient_checkpointing#

True

_no_split_modules#

[‘Qwen3ASRThinkerTextDecoderLayer’]

_skip_keys_device_placement#

[‘past_key_values’]

_supports_flash_attn#

True

_supports_sdpa#

True

_supports_flex_attn#

True

_can_compile_fullgraph#

False

_supports_attention_backend#

True

_can_record_outputs#

None

config_class#

None

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRForConditionalGeneration(
config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig,
)#

Bases: bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel, transformers.generation.GenerationMixin

Qwen3-ASR model for conditional generation.

Initialization

config_class#

None

get_support_languages()#
generate(
input_ids: Optional[torch.Tensor] = None,
max_new_tokens: int = 4096,
eos_token_id: int | list[int] = [151645, 151643],
**kwargs,
)#
bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.__all__#

[‘Qwen3ASRForConditionalGeneration’, ‘Qwen3ASRThinkerTextModel’, ‘Qwen3ASRThinkerForConditionalGener…