bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr#
Module Contents#
Classes#
Multi-headed attention from ‘Attention Is All You Need’ paper |
|
Base pretrained model for Qwen3-ASR. |
|
|
|
Base pretrained model for Qwen3-ASR conditional generation. |
|
Multi-headed attention from ‘Attention Is All You Need’ paper |
|
Multi-headed attention from ‘Attention Is All You Need’ paper |
|
Text model component of the Qwen3-ASR thinker. |
|
Qwen3-ASR thinker model for conditional generation. |
|
Base pretrained model for the Qwen3-ASR thinker text component. |
|
Qwen3-ASR model for conditional generation. |
Functions#
Rotates half the hidden dims of the input. |
|
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim) |
|
Applies Rotary Position Embedding to the query and key tensors. |
|
Computes the output length of the convolutional layers and the output length of the audio encoder |
Data#
API#
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRTextRMSNorm(hidden_size, eps: float = 1e-06)#
Bases:
torch.nn.ModuleInitialization
Qwen3ASRTextRMSNorm is equivalent to T5LayerNorm
- forward(hidden_states: torch.Tensor) torch.Tensor#
- extra_repr()#
- bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.rotate_half(x)#
Rotates half the hidden dims of the input.
- bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.repeat_kv(hidden_states: torch.Tensor, n_rep: int) torch.Tensor#
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
- bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.eager_attention_forward(
- module: torch.nn.Module,
- query: torch.Tensor,
- key: torch.Tensor,
- value: torch.Tensor,
- attention_mask: Optional[torch.Tensor],
- scaling: float,
- dropout: float = 0.0,
- **kwargs: transformers.processing_utils.Unpack[transformers.utils.generic.TransformersKwargs],
- bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.apply_rotary_pos_emb(
- q,
- k,
- cos,
- sin,
- position_ids=None,
- unsqueeze_dim=1,
Applies Rotary Position Embedding to the query and key tensors.
- Parameters:
q (
torch.Tensor) – The query tensor.k (
torch.Tensor) – The key tensor.cos (
torch.Tensor) – The cosine part of the rotary embedding.sin (
torch.Tensor) – The sine part of the rotary embedding.position_ids (
torch.Tensor, optional) – Deprecated and unused.unsqueeze_dim (
int, optional, defaults to 1) – The ‘unsqueeze_dim’ argument specifies the dimension along which to unsqueeze cos[position_ids] and sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
- Returns:
tuple(torch.Tensor)comprising of the query and key tensors rotated using the Rotary Position Embedding.
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRTextAttention(
- config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig,
- layer_idx: int,
Bases:
torch.nn.ModuleMulti-headed attention from ‘Attention Is All You Need’ paper
Initialization
- forward(
- hidden_states: torch.Tensor,
- position_embeddings: tuple[torch.Tensor, torch.Tensor],
- attention_mask: Optional[torch.Tensor],
- past_key_values: Optional[transformers.cache_utils.Cache] = None,
- cache_position: Optional[torch.LongTensor] = None,
- **kwargs: transformers.processing_utils.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs],
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRTextMLP(config)#
Bases:
torch.nn.ModuleInitialization
- forward(x)#
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextDecoderLayer(
- config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig,
- layer_idx: int,
Bases:
transformers.modeling_layers.GradientCheckpointingLayerInitialization
- forward(
- hidden_states: torch.Tensor,
- position_embeddings: tuple[torch.Tensor, torch.Tensor],
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_values: Optional[transformers.cache_utils.Cache] = None,
- use_cache: Optional[bool] = False,
- cache_position: Optional[torch.LongTensor] = None,
- **kwargs: transformers.processing_utils.Unpack[transformers.utils.generic.TransformersKwargs],
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel#
Bases:
transformers.modeling_utils.PreTrainedModelBase pretrained model for Qwen3-ASR.
- base_model_prefix#
‘model’
- supports_gradient_checkpointing#
True
- _skip_keys_device_placement#
‘past_key_values’
- _supports_flash_attn#
True
- _supports_sdpa#
True
- _can_compile_fullgraph#
True
- _supports_attention_backend#
True
- _can_record_outputs#
None
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerCausalLMOutputWithPast#
Bases:
transformers.modeling_outputs.MoeCausalLMOutputWithPast- Parameters:
rope_deltas (
torch.LongTensorof shape(batch_size, ), optional) – The rope index difference between sequence length and multimodal rope.
- rope_deltas: Optional[torch.LongTensor]#
None
- bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr._get_feat_extract_output_lengths(input_lengths)#
Computes the output length of the convolutional layers and the output length of the audio encoder
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModelForConditionalGeneration#
Bases:
bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModelBase pretrained model for Qwen3-ASR conditional generation.
- _prepare_4d_causal_attention_mask_with_cache_position(
- attention_mask: torch.Tensor,
- sequence_length: int,
- target_length: int,
- dtype: torch.dtype,
- device: torch.device,
- min_dtype: float,
- cache_position: torch.Tensor,
- batch_size: int,
Creates a causal 4D mask of shape
(batch_size, 1, query_length, key_value_length)from a 2D mask of shape(batch_size, key_value_length), or if the inputattention_maskis already 4D, do nothing.- Parameters:
attention_mask (
torch.Tensor) – A 2D attention mask of shape(batch_size, key_value_length)or a 4D attention mask of shape(batch_size, 1, query_length, key_value_length).sequence_length (
int) – The sequence length being processed.target_length (
int) – The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.dtype (
torch.dtype) – The dtype to use for the 4D attention mask.device (
torch.device) – The device to place the 4D attention mask on.min_dtype (
float) – The minimum value representable with the dtypedtype.cache_position (
torch.Tensor) – Indices depicting the position of the input sequence tokens in the sequence.batch_size (
torch.Tensor) – Batch size.
- get_chunked_index(
- token_indices: torch.Tensor,
- tokens_per_chunk: int,
- remove_index: int,
Splits token index list into chunks based on token value ranges.
Given a list of token indices, returns a list of (start, end) index tuples representing slices of the list where the token values fall within successive ranges of
t_ntoken_per_chunk.For example, if
t_ntoken_per_chunkis 1000, the function will create chunks such that:the first chunk contains token values < 1000,
the second chunk contains values >= 1000 and < 2000, and so on.
- Parameters:
token_indices (
torch.Tensorof shape(seq_len, )) – A monotonically increasing list of token index values.t_ntoken_per_chunk (
int) – Number of tokens per chunk (used as the chunk size threshold).remove_index (
int)
- Returns:
A list of tuples, each representing the start (inclusive) and end (exclusive) indices of a chunk in
token_indices.- Return type:
list[tuple[int, int]]
- get_rope_index(
- attention_mask: Optional[torch.Tensor] = None,
Calculate the rope index in LLM.
Explanation: Each embedding sequence contains text embedding.
- Parameters:
input_ids (
torch.LongTensorof shape(batch_size, sequence_length)) – Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) –Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]:1 for tokens that are not masked,
0 for tokens that are masked.
audio_seqlens (
torch.LongTensorof shape(num_audios), optional) – The length of feature shape of each audio in LLM.
- Returns:
position_ids (
torch.LongTensorof shape(3, batch_size, sequence_length)) mrope_position_deltas (torch.Tensorof shape(batch_size))
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRAudioAttention(config)#
Bases:
torch.nn.ModuleMulti-headed attention from ‘Attention Is All You Need’ paper
Initialization
- forward(
- hidden_states: torch.Tensor,
- cu_seqlens: Optional[torch.Tensor] = None,
- attention_mask: Optional[torch.Tensor] = None,
- **kwargs,
Input shape: Batch x Time x Channel
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRAudioEncoderLayer( )#
Bases:
transformers.modeling_layers.GradientCheckpointingLayerInitialization
- forward(
- hidden_states: torch.Tensor,
- cu_seqlens: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- **kwargs,
- Parameters:
hidden_states (
torch.FloatTensor) – input to the layer of shape(batch, seq_len, embed_dim)attention_mask (
torch.FloatTensor) – attention mask of size(batch, 1, tgt_len, src_len)where padding elements are indicated by very large negative values.layer_head_mask (
torch.FloatTensor) – mask for attention heads in a given layer of size(encoder_attention_heads,).output_attentions (
bool, optional) – Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.SinusoidsPositionEmbedding(length, channels, max_timescale=10000)#
Bases:
torch.nn.ModuleInitialization
- forward(seqlen: int)#
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRAudioEncoder( )#
Bases:
bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel- config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRAudioEncoderConfig#
None
- main_input_name#
‘input_features’
- _no_split_modules#
[‘Qwen3ASRAudioEncoderLayer’]
- _supports_sdpa#
True
- _freeze_parameters()#
- get_input_embeddings() torch.nn.Module#
- set_input_embeddings(value: torch.nn.Module)#
- _prepare_attention_mask(
- inputs_tensor: torch.Tensor,
- cu_seqlens: torch.Tensor,
- forward(input_features, feature_lens=None, aftercnn_lens=None)#
feature_lens (
torch.LongTensorof shape(batch_size,)): mel length aftercnn_lens (torch.LongTensorof shape(batch_size,)): mel length after cnn
- padded_and_mask_function(
- tensor_list,
- tensor_len,
- padding_value=0,
- padding_side='right',
Pads a sequence of tensors to their maximum length on indicated
padding_side. Then prepares a mask so that pad tokens are not attended to.
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextRotaryEmbedding(
- config: bridge.models.qwen3_asr.hf_qwen3_asr.configuration_qwen3_asr.Qwen3ASRConfig,
- device=None,
Bases:
torch.nn.ModuleInitialization
- inv_freq: torch.Tensor#
None
- apply_interleaved_mrope(freqs, mrope_section)#
Apply interleaved MRoPE to 3D rotary embeddings. Reorganizes frequency layout from chunked [TTT…HHH…WWW] to interleaved [THTHWHTHW…TT], preserving frequency continuity.
- Parameters:
x – (3, bs, seq_len, head_dim // 2)
mrope_section – (3,)
- Returns:
(bs, seq_len, head_dim // 2)
- Return type:
x_t
- forward(x, position_ids)#
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextMLP(config, intermediate_size=None)#
Bases:
torch.nn.ModuleInitialization
- forward(x)#
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextRMSNorm(hidden_size, eps=1e-06)#
Bases:
torch.nn.ModuleInitialization
Qwen3ASRThinkerTextRMSNorm is equivalent to T5LayerNorm
- forward(hidden_states)#
- extra_repr()#
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextAttention(config, layer_idx)#
Bases:
torch.nn.ModuleMulti-headed attention from ‘Attention Is All You Need’ paper
Initialization
- forward(
- hidden_states: torch.Tensor,
- position_embeddings: tuple[torch.Tensor, torch.Tensor],
- attention_mask: Optional[torch.Tensor],
- past_key_values: Optional[transformers.cache_utils.Cache] = None,
- cache_position: Optional[torch.LongTensor] = None,
- **kwargs: transformers.processing_utils.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs],
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextModel( )#
Bases:
bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModelText model component of the Qwen3-ASR thinker.
Initialization
- _no_split_modules#
[‘Qwen3ASRThinkerTextDecoderLayer’]
- config_class#
None
- _can_record_outputs#
None
- forward(
- input_ids: Optional[torch.LongTensor] = None,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_values: Optional[transformers.cache_utils.Cache] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- use_cache: Optional[bool] = None,
- cache_position: Optional[torch.LongTensor] = None,
- **kwargs: transformers.processing_utils.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs],
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerForConditionalGeneration(config)#
Bases:
bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModelForConditionalGeneration,transformers.generation.GenerationMixinQwen3-ASR thinker model for conditional generation.
Initialization
- base_model_prefix#
‘thinker’
- _tied_weights_keys#
[‘model.embed_tokens.weight’, ‘lm_head.weight’]
- _no_split_modules#
[‘Qwen3ASRAudioEncoderLayer’, ‘Qwen3ASRThinkerTextDecoderLayer’]
- _can_record_outputs#
None
- get_input_embeddings()#
- set_input_embeddings(value)#
- get_audio_features(
- input_features: torch.FloatTensor,
- feature_attention_mask: Optional[torch.LongTensor] = None,
- audio_feature_lengths: Optional[torch.LongTensor] = None,
Encodes audios into continuous embeddings that can be forwarded to the language model.
- Parameters:
input_features (
torch.FloatTensor) – The tensors corresponding to the input audios.feature_attention_mask (
torch.LongTensor, optional) – Mask to avoid performing attention on padding feature indices. Mask values selected in[0, 1]:audio_feature_lengths (
torch.LongTensorof shape(num_audios), optional) – The length of feature shape of each audio in LLM.
- get_placeholder_mask(
- input_ids: torch.LongTensor,
- inputs_embeds: torch.FloatTensor,
Obtains multimodal placeholder mask from
input_idsorinputs_embeds, and checks that the placeholder token count is equal to the length of multimodal features. If the lengths are different, an error is raised.
- forward(
- input_ids=None,
- input_features=None,
- attention_mask=None,
- feature_attention_mask=None,
- audio_feature_lengths=None,
- position_ids=None,
- past_key_values=None,
- inputs_embeds=None,
- rope_deltas=None,
- labels=None,
- use_cache=None,
- cache_position=None,
- **kwargs,
feature_attention_mask (
torch.Tensorof shape(batch_size, feature_sequence_length), optional): Mask to avoid performing attention on padding feature indices. Mask values selected in[0, 1]: - 1 for tokens that are not masked, - 0 for tokens that are masked. audio_feature_lengths (torch.LongTensorof shape(num_audios), optional): The length of feature shape of each audio in LLM. rope_deltas (torch.LongTensorof shape(batch_size, ), optional): The rope index difference between sequence length and multimodal rope. labels (torch.LongTensorof shape(batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].
- prepare_inputs_for_generation(
- input_ids,
- past_key_values=None,
- attention_mask=None,
- inputs_embeds=None,
- cache_position=None,
- position_ids=None,
- use_cache=True,
- input_features=None,
- feature_attention_mask=None,
- **kwargs,
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRThinkerTextPreTrainedModel#
Bases:
transformers.modeling_utils.PreTrainedModelBase pretrained model for the Qwen3-ASR thinker text component.
- config#
None
- base_model_prefix#
‘model’
- supports_gradient_checkpointing#
True
- _no_split_modules#
[‘Qwen3ASRThinkerTextDecoderLayer’]
- _skip_keys_device_placement#
[‘past_key_values’]
- _supports_flash_attn#
True
- _supports_sdpa#
True
- _supports_flex_attn#
True
- _can_compile_fullgraph#
False
- _supports_attention_backend#
True
- _can_record_outputs#
None
- config_class#
None
- class bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRForConditionalGeneration( )#
Bases:
bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.Qwen3ASRPreTrainedModel,transformers.generation.GenerationMixinQwen3-ASR model for conditional generation.
Initialization
- config_class#
None
- get_support_languages()#
- generate(
- input_ids: Optional[torch.Tensor] = None,
- max_new_tokens: int = 4096,
- eos_token_id: int | list[int] = [151645, 151643],
- **kwargs,
- bridge.models.qwen3_asr.hf_qwen3_asr.modeling_qwen3_asr.__all__#
[‘Qwen3ASRForConditionalGeneration’, ‘Qwen3ASRThinkerTextModel’, ‘Qwen3ASRThinkerForConditionalGener…