`bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr`#

Module Contents#

Classes#

`Qwen3ASRTextRMSNorm`
`Qwen3ASRTextAttention`	Multi-headed attention from ‘Attention Is All You Need’ paper
`Qwen3ASRTextMLP`
`Qwen3ASRThinkerTextDecoderLayer`
`Qwen3ASRPreTrainedModel`	Base pretrained model for Qwen3-ASR.
`Qwen3ASRThinkerCausalLMOutputWithPast`	param rope_deltas: The rope index difference between sequence length and multimodal rope. type rope_deltas: `torch.LongTensor` of shape `(batch_size, )`, optional
`Qwen3ASRPreTrainedModelForConditionalGeneration`	Base pretrained model for Qwen3-ASR conditional generation.
`Qwen3ASRAudioAttention`	Multi-headed attention from ‘Attention Is All You Need’ paper
`Qwen3ASRAudioEncoderLayer`
`SinusoidsPositionEmbedding`
`Qwen3ASRAudioEncoder`
`Qwen3ASRThinkerTextRotaryEmbedding`
`Qwen3ASRThinkerTextMLP`
`Qwen3ASRThinkerTextRMSNorm`
`Qwen3ASRThinkerTextAttention`	Multi-headed attention from ‘Attention Is All You Need’ paper
`Qwen3ASRThinkerTextModel`	Text model component of the Qwen3-ASR thinker.
`Qwen3ASRThinkerForConditionalGeneration`	Qwen3-ASR thinker model for conditional generation.
`Qwen3ASRThinkerTextPreTrainedModel`	Base pretrained model for the Qwen3-ASR thinker text component.
`Qwen3ASRForConditionalGeneration`	Qwen3-ASR model for conditional generation.

Functions#

`rotate_half`	Rotates half the hidden dims of the input.
`repeat_kv`	This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
`eager_attention_forward`
`apply_rotary_pos_emb`	Applies Rotary Position Embedding to the query and key tensors.
`_get_feat_extract_output_lengths`	Computes the output length of the convolutional layers and the output length of the audio encoder

Data#

__all__

API#

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRTextRMSNorm(hidden_size, eps: float = 1e-06)#

Bases: torch.nn.Module

Initialization

Qwen3ASRTextRMSNorm is equivalent to T5LayerNorm

forward(hidden_states: torch.Tensor) → torch.Tensor#

extra_repr()#

bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.rotate_half(x)#: Rotates half the hidden dims of the input.

bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.repeat_kv(hidden_states: torch.Tensor, n_rep: int) → torch.Tensor#: This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)

bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.eager_attention_forward(

module: torch.nn.Module,

query: torch.Tensor,

key: torch.Tensor,

value: torch.Tensor,

attention_mask: Optional[torch.Tensor],

scaling: float,

dropout: float = 0.0,

**kwargs: transformers.processing_utils.Unpack[transformers.utils.generic.TransformersKwargs],

)#

bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.apply_rotary_pos_emb( q, k, cos, sin, position_ids=None, unsqueeze_dim=1, )#

Applies Rotary Position Embedding to the query and key tensors.

Parameters:

q (torch.Tensor) – The query tensor.
k (torch.Tensor) – The key tensor.
cos (torch.Tensor) – The cosine part of the rotary embedding.
sin (torch.Tensor) – The sine part of the rotary embedding.
position_ids (torch.Tensor, optional) – Deprecated and unused.
unsqueeze_dim (int, optional, defaults to 1) – The ‘unsqueeze_dim’ argument specifies the dimension along which to unsqueeze cos[position_ids] and sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.

Returns:

tuple(torch.Tensor) comprising of the query and key tensors rotated using the Rotary Position Embedding.

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRTextAttention( config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig, layer_idx: int, )#

Bases: torch.nn.Module

Multi-headed attention from ‘Attention Is All You Need’ paper

Initialization

forward(

hidden_states: torch.Tensor,

position_embeddings: tuple[torch.Tensor, torch.Tensor],

attention_mask: Optional[torch.Tensor],

past_key_values: Optional[transformers.cache_utils.Cache] = None,

cache_position: Optional[torch.LongTensor] = None,

**kwargs: transformers.processing_utils.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs],

) → tuple[torch.Tensor, Optional[torch.Tensor]]#

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRTextMLP(config)#

Bases: torch.nn.Module

Initialization

forward(x)#

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextDecoderLayer( config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig, layer_idx: int, )#

Bases: transformers.modeling_layers.GradientCheckpointingLayer

Initialization

forward(

hidden_states: torch.Tensor,

position_embeddings: tuple[torch.Tensor, torch.Tensor],

attention_mask: Optional[torch.Tensor] = None,

position_ids: Optional[torch.LongTensor] = None,

past_key_values: Optional[transformers.cache_utils.Cache] = None,

use_cache: Optional[bool] = False,

cache_position: Optional[torch.LongTensor] = None,

**kwargs: transformers.processing_utils.Unpack[transformers.utils.generic.TransformersKwargs],

) → torch.Tensor#

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel#

Bases: transformers.modeling_utils.PreTrainedModel

Base pretrained model for Qwen3-ASR.

config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig#: None

base_model_prefix#: ‘model’

supports_gradient_checkpointing#: True

_skip_keys_device_placement#: ‘past_key_values’

_supports_flash_attn#: True

_supports_sdpa#: True

_can_compile_fullgraph#: True

_supports_attention_backend#: True

_can_record_outputs#: None

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerCausalLMOutputWithPast#

Bases: transformers.modeling_outputs.MoeCausalLMOutputWithPast

Parameters:: rope_deltas (torch.LongTensor of shape (batch_size, ), optional) – The rope index difference between sequence length and multimodal rope.

rope_deltas: Optional[torch.LongTensor]#: None

bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr._get_feat_extract_output_lengths(input_lengths)#: Computes the output length of the convolutional layers and the output length of the audio encoder

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModelForConditionalGeneration#

Bases: bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel

Base pretrained model for Qwen3-ASR conditional generation.

_prepare_4d_causal_attention_mask_with_cache_position( attention_mask: torch.Tensor, sequence_length: int, target_length: int, dtype: torch.dtype, device: torch.device, min_dtype: float, cache_position: torch.Tensor, batch_size: int, )#

Creates a causal 4D mask of shape (batch_size, 1, query_length, key_value_length) from a 2D mask of shape (batch_size, key_value_length), or if the input attention_mask is already 4D, do nothing.

Parameters:

attention_mask (torch.Tensor) – A 2D attention mask of shape (batch_size, key_value_length) or a 4D attention mask of shape (batch_size, 1, query_length, key_value_length).
sequence_length (int) – The sequence length being processed.
target_length (int) – The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
dtype (torch.dtype) – The dtype to use for the 4D attention mask.
device (torch.device) – The device to place the 4D attention mask on.
min_dtype (float) – The minimum value representable with the dtype dtype.
cache_position (torch.Tensor) – Indices depicting the position of the input sequence tokens in the sequence.
batch_size (torch.Tensor) – Batch size.

get_chunked_index( token_indices: torch.Tensor, tokens_per_chunk: int, remove_index: int, ) → list[tuple[int, int]]#

Splits token index list into chunks based on token value ranges.

Given a list of token indices, returns a list of (start, end) index tuples representing slices of the list where the token values fall within successive ranges of t_ntoken_per_chunk.

For example, if t_ntoken_per_chunk is 1000, the function will create chunks such that:

the first chunk contains token values < 1000,
the second chunk contains values >= 1000 and < 2000, and so on.

Parameters:

token_indices (torch.Tensor of shape (seq_len, )) – A monotonically increasing list of token index values.
t_ntoken_per_chunk (int) – Number of tokens per chunk (used as the chunk size threshold).
remove_index (int)

Returns:

A list of tuples, each representing the start (inclusive) and end (exclusive) indices of a chunk in token_indices.

Return type:

list[tuple[int, int]]

get_rope_index( attention_mask: Optional[torch.Tensor] = None, ) → tuple[torch.Tensor, torch.Tensor]#

Calculate the rope index in LLM.

Explanation: Each embedding sequence contains text embedding.

Parameters:

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) – Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
audio_seqlens (torch.LongTensor of shape (num_audios), optional) – The length of feature shape of each audio in LLM.

Returns:

position_ids (torch.LongTensor of shape (3, batch_size, sequence_length)) mrope_position_deltas (torch.Tensor of shape (batch_size))

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRAudioAttention(config)#

Bases: torch.nn.Module

Multi-headed attention from ‘Attention Is All You Need’ paper

Initialization

forward(

hidden_states: torch.Tensor,

cu_seqlens: Optional[torch.Tensor] = None,

attention_mask: Optional[torch.Tensor] = None,

**kwargs,

) → tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]#: Input shape: Batch x Time x Channel

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRAudioEncoderLayer( config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRAudioEncoderConfig, )#

Bases: transformers.modeling_layers.GradientCheckpointingLayer

Initialization

forward(

hidden_states: torch.Tensor,

cu_seqlens: torch.Tensor,

attention_mask: Optional[torch.Tensor] = None,

**kwargs,

) → torch.Tensor#

Parameters:

hidden_states (torch.FloatTensor) – input to the layer of shape (batch, seq_len, embed_dim)
attention_mask (torch.FloatTensor) – attention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values.
layer_head_mask (torch.FloatTensor) – mask for attention heads in a given layer of size (encoder_attention_heads,).
output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.SinusoidsPositionEmbedding(length, channels, max_timescale=10000)#

Bases: torch.nn.Module

Initialization

forward(seqlen: int)#

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRAudioEncoder( config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRAudioEncoderConfig, )#

Bases: bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel

config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRAudioEncoderConfig#: None

main_input_name#: ‘input_features’

_no_split_modules#: [‘Qwen3ASRAudioEncoderLayer’]

_supports_sdpa#: True

_freeze_parameters()#

get_input_embeddings() → torch.nn.Module#

set_input_embeddings(value: torch.nn.Module)#

_prepare_attention_mask( inputs_tensor: torch.Tensor, cu_seqlens: torch.Tensor, ) → torch.Tensor#

forward(input_features, feature_lens=None, aftercnn_lens=None)#: feature_lens (torch.LongTensor of shape (batch_size,)): mel length aftercnn_lens (torch.LongTensor of shape (batch_size,)): mel length after cnn

padded_and_mask_function( tensor_list, tensor_len, padding_value=0, padding_side='right', )#: Pads a sequence of tensors to their maximum length on indicated padding_side. Then prepares a mask so that pad tokens are not attended to.

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextRotaryEmbedding( config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig, device=None, )#

Bases: torch.nn.Module

Initialization

inv_freq: torch.Tensor#: None

apply_interleaved_mrope(freqs, mrope_section)#

Apply interleaved MRoPE to 3D rotary embeddings. Reorganizes frequency layout from chunked [TTT…HHH…WWW] to interleaved [THTHWHTHW…TT], preserving frequency continuity.

Parameters:

x – (3, bs, seq_len, head_dim // 2)
mrope_section – (3,)

Returns:

(bs, seq_len, head_dim // 2)

Return type:

x_t

forward(x, position_ids)#

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextMLP(config, intermediate_size=None)#

Bases: torch.nn.Module

Initialization

forward(x)#

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextRMSNorm(hidden_size, eps=1e-06)#

Bases: torch.nn.Module

Initialization

Qwen3ASRThinkerTextRMSNorm is equivalent to T5LayerNorm

forward(hidden_states)#

extra_repr()#

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextAttention(config, layer_idx)#

Bases: torch.nn.Module

Multi-headed attention from ‘Attention Is All You Need’ paper

Initialization

forward(

hidden_states: torch.Tensor,

position_embeddings: tuple[torch.Tensor, torch.Tensor],

attention_mask: Optional[torch.Tensor],

past_key_values: Optional[transformers.cache_utils.Cache] = None,

cache_position: Optional[torch.LongTensor] = None,

**kwargs: transformers.processing_utils.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs],

) → tuple[torch.Tensor, Optional[torch.Tensor]]#

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextModel( config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig, )#

Bases: bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel

Text model component of the Qwen3-ASR thinker.

Initialization

config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig#: None

_no_split_modules#: [‘Qwen3ASRThinkerTextDecoderLayer’]

config_class#: None

_can_record_outputs#: None

forward(

input_ids: Optional[torch.LongTensor] = None,

attention_mask: Optional[torch.Tensor] = None,

position_ids: Optional[torch.LongTensor] = None,

past_key_values: Optional[transformers.cache_utils.Cache] = None,

inputs_embeds: Optional[torch.FloatTensor] = None,

use_cache: Optional[bool] = None,

cache_position: Optional[torch.LongTensor] = None,

**kwargs: transformers.processing_utils.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs],

) → Union[tuple, transformers.modeling_outputs.BaseModelOutputWithPast]#: cache_position (torch.LongTensor of shape (sequence_length), optional): Indices depicting the position of the input sequence tokens in the sequence.

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerForConditionalGeneration(config)#

Bases: bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModelForConditionalGeneration, transformers.generation.GenerationMixin

Qwen3-ASR thinker model for conditional generation.

Initialization

config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRThinkerConfig#: None

base_model_prefix#: ‘thinker’

_tied_weights_keys#: [‘model.embed_tokens.weight’, ‘lm_head.weight’]

_no_split_modules#: [‘Qwen3ASRAudioEncoderLayer’, ‘Qwen3ASRThinkerTextDecoderLayer’]

_can_record_outputs#: None

get_input_embeddings()#

set_input_embeddings(value)#

get_audio_features( input_features: torch.FloatTensor, feature_attention_mask: Optional[torch.LongTensor] = None, audio_feature_lengths: Optional[torch.LongTensor] = None, )#

Encodes audios into continuous embeddings that can be forwarded to the language model.

Parameters:

input_features (torch.FloatTensor) – The tensors corresponding to the input audios.
feature_attention_mask (torch.LongTensor, optional) – Mask to avoid performing attention on padding feature indices. Mask values selected in [0, 1]:
audio_feature_lengths (torch.LongTensor of shape (num_audios), optional) – The length of feature shape of each audio in LLM.

get_placeholder_mask( input_ids: torch.LongTensor, inputs_embeds: torch.FloatTensor, )#: Obtains multimodal placeholder mask from input_ids or inputs_embeds, and checks that the placeholder token count is equal to the length of multimodal features. If the lengths are different, an error is raised.

forward(

input_ids=None,

input_features=None,

attention_mask=None,

feature_attention_mask=None,

audio_feature_lengths=None,

position_ids=None,

past_key_values=None,

inputs_embeds=None,

rope_deltas=None,

labels=None,

use_cache=None,

cache_position=None,

**kwargs,

) → Union[tuple, bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerCausalLMOutputWithPast]#: cache_position (torch.LongTensor of shape (sequence_length), optional): Indices depicting the position of the input sequence tokens in the sequence. feature_attention_mask (torch.Tensor of shape (batch_size, feature_sequence_length), optional): Mask to avoid performing attention on padding feature indices. Mask values selected in [0, 1]: - 1 for tokens that are not masked, - 0 for tokens that are masked. audio_feature_lengths (torch.LongTensor of shape (num_audios), optional): The length of feature shape of each audio in LLM. rope_deltas (torch.LongTensor of shape (batch_size, ), optional): The rope index difference between sequence length and multimodal rope. labels (torch.LongTensor of shape (batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

prepare_inputs_for_generation(

input_ids,

past_key_values=None,

attention_mask=None,

inputs_embeds=None,

cache_position=None,

position_ids=None,

use_cache=True,

input_features=None,

feature_attention_mask=None,

**kwargs,

)#

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextPreTrainedModel#

Bases: transformers.modeling_utils.PreTrainedModel

Base pretrained model for the Qwen3-ASR thinker text component.

config#: None

base_model_prefix#: ‘model’

supports_gradient_checkpointing#: True

_no_split_modules#: [‘Qwen3ASRThinkerTextDecoderLayer’]

_skip_keys_device_placement#: [‘past_key_values’]

_supports_flash_attn#: True

_supports_sdpa#: True

_supports_flex_attn#: True

_can_compile_fullgraph#: False

_supports_attention_backend#: True

_can_record_outputs#: None

config_class#: None

class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRForConditionalGeneration( config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig, )#

Bases: bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel, transformers.generation.GenerationMixin

Qwen3-ASR model for conditional generation.

Initialization

config_class#: None

get_support_languages()#

generate(

input_ids: Optional[torch.Tensor] = None,

max_new_tokens: int = 4096,

eos_token_id: int | list[int] = [151645, 151643],

**kwargs,

)#

bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.__all__#: [‘Qwen3ASRForConditionalGeneration’, ‘Qwen3ASRThinkerTextModel’, ‘Qwen3ASRThinkerForConditionalGener…

bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr#

Module Contents#

Classes#

Functions#

Data#

API#

`bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr`#