bridge.models.qwen_vl.modelling_qwen3_vl.rope#
Module Contents#
Classes#
Multimodal Rotary Embedding for language model. only support for qwen3vl |
Functions#
Different from the original implementation, Qwen3VL use timestamps rather than absolute time position ids. |
|
A baseline implementation of applying RoPE for |
|
Reroute to the appropriate apply_rotary_pos_emb function depending on bshd (conventional) / thd (packed seq) format |
API#
- class bridge.models.qwen_vl.modelling_qwen3_vl.rope.Qwen3VLMultimodalRotaryEmbedding(
- kv_channels: int,
- rotary_percent: float = 1.0,
- rotary_interleaved: bool = False,
- seq_len_interpolation_factor: Optional[float] = None,
- rotary_base: int = 10000,
- cp_group: torch.distributed.ProcessGroup = None,
Bases:
torch.nn.ModuleMultimodal Rotary Embedding for language model. only support for qwen3vl
- Parameters:
kv_channels (int) – Projection weights dimension in multi-head attention. Obtained from transformer config
rotary_percent (float) – Percent of rotary dimension to use for rotary position embeddings.
rotary_interleaved (bool, optional) – If True, interleaved rotary position embeddings. Defaults to False.
seq_len_interpolation_factor (float, optional) – scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None
rotary_base (int, optional) – Base period for rotary position embeddings. Defaults to 10000.
Initialization
- apply_interleaved_mrope(freqs, mrope_section)#
Apply interleaved MRoPE to 3D rotary embeddings. Reorganizes frequency layout from chunked [TTT…HHH…WWW] to interleaved [THTHWHTHW…TT], preserving frequency continuity.
- Parameters:
x – (3, bs, seq_len, head_dim // 2)
mrope_section – (3,)
- Returns:
(bs, seq_len, head_dim // 2)
- Return type:
x_t
- forward(
- position_ids: torch.Tensor,
- mrope_section: List[int] | None,
- packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
- **kwargs,
Forward pass of multimodal RoPE embedding.
- Parameters:
position_ids (torch.Tensor) – A postion_id tensor with shape [3, batchsize, seqlens]
mrope_section (list[int]) – Multimodal rope section is for channel dimension of temporal, height and width in rope calculation.
packed_seq_params (PackedSeqParams, optional) – Packed sequence params. Defaults to None.
- Returns:
Embeddings after applying RoPE.
- Return type:
Tensor
- bridge.models.qwen_vl.modelling_qwen3_vl.rope.get_rope_index(
- spatial_merge_size: int,
- image_token_id: int,
- video_token_id: int,
- vision_start_token_id: int,
- input_ids: Optional[torch.LongTensor] = None,
- image_grid_thw: Optional[torch.LongTensor] = None,
- video_grid_thw: Optional[torch.LongTensor] = None,
- attention_mask: Optional[torch.Tensor] = None,
- packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
Different from the original implementation, Qwen3VL use timestamps rather than absolute time position ids.
- bridge.models.qwen_vl.modelling_qwen3_vl.rope.apply_rotary_pos_emb_thd_absolute(
- t: torch.Tensor,
- cu_seqlens: torch.Tensor,
- freqs: torch.Tensor,
- rotary_interleaved: bool = False,
A baseline implementation of applying RoPE for
thdformat.- Parameters:
t (Tensor) – Input tensor T is of shape [t, h, d]
cu_seqlens (Tensor) – Cumulative sum of sequence lengths in a batch for
t,consistency. (with shape [b + 1] and dtype torch.int32. Currently unused but kept for API)
freqs (Tensor) – Rotary Positional embedding tensor freq is of shape [max_s, 1, 1, d]
- Returns:
Shape [t, h, d]. The input tensor after applying RoPE.
- Return type:
Tensor
- bridge.models.qwen_vl.modelling_qwen3_vl.rope.apply_rotary_pos_emb_absolute(
- t: torch.Tensor,
- freqs: torch.Tensor,
- config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
- cu_seqlens: Optional[torch.Tensor] = None,
Reroute to the appropriate apply_rotary_pos_emb function depending on bshd (conventional) / thd (packed seq) format
In Qwen3-VL, the shape of freqs is (seq_length, bs, 1, 2 * dim) instead of [max_seqlen, 1, 1, 2 * dim]