bridge.models.qwen_vl.modelling_qwen3_vl.rope#

Module Contents#

Classes#

Qwen3VLMultimodalRotaryEmbedding

Multimodal Rotary Embedding for language model. only support for qwen3vl

Functions#

get_rope_index

Different from the original implementation, Qwen3VL use timestamps rather than absolute time position ids.

apply_rotary_pos_emb_thd_absolute

A baseline implementation of applying RoPE for thd format.

apply_rotary_pos_emb_absolute

Reroute to the appropriate apply_rotary_pos_emb function depending on bshd (conventional) / thd (packed seq) format

API#

class bridge.models.qwen_vl.modelling_qwen3_vl.rope.Qwen3VLMultimodalRotaryEmbedding(
kv_channels: int,
rotary_percent: float = 1.0,
rotary_interleaved: bool = False,
seq_len_interpolation_factor: Optional[float] = None,
rotary_base: int = 10000,
cp_group: torch.distributed.ProcessGroup = None,
)#

Bases: torch.nn.Module

Multimodal Rotary Embedding for language model. only support for qwen3vl

Parameters:
  • kv_channels (int) – Projection weights dimension in multi-head attention. Obtained from transformer config

  • rotary_percent (float) – Percent of rotary dimension to use for rotary position embeddings.

  • rotary_interleaved (bool, optional) – If True, interleaved rotary position embeddings. Defaults to False.

  • seq_len_interpolation_factor (float, optional) – scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None

  • rotary_base (int, optional) – Base period for rotary position embeddings. Defaults to 10000.

Initialization

apply_interleaved_mrope(freqs, mrope_section)#

Apply interleaved MRoPE to 3D rotary embeddings. Reorganizes frequency layout from chunked [TTT…HHH…WWW] to interleaved [THTHWHTHW…TT], preserving frequency continuity.

Parameters:
  • x – (3, bs, seq_len, head_dim // 2)

  • mrope_section – (3,)

Returns:

(bs, seq_len, head_dim // 2)

Return type:

x_t

forward(
position_ids: torch.Tensor,
mrope_section: List[int] | None,
packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
**kwargs,
) torch.Tensor#

Forward pass of multimodal RoPE embedding.

Parameters:
  • position_ids (torch.Tensor) – A postion_id tensor with shape [3, batchsize, seqlens]

  • mrope_section (list[int]) – Multimodal rope section is for channel dimension of temporal, height and width in rope calculation.

  • packed_seq_params (PackedSeqParams, optional) – Packed sequence params. Defaults to None.

Returns:

Embeddings after applying RoPE.

Return type:

Tensor

bridge.models.qwen_vl.modelling_qwen3_vl.rope.get_rope_index(
spatial_merge_size: int,
image_token_id: int,
video_token_id: int,
vision_start_token_id: int,
input_ids: Optional[torch.LongTensor] = None,
image_grid_thw: Optional[torch.LongTensor] = None,
video_grid_thw: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
) tuple[torch.Tensor, torch.Tensor]#

Different from the original implementation, Qwen3VL use timestamps rather than absolute time position ids.

bridge.models.qwen_vl.modelling_qwen3_vl.rope.apply_rotary_pos_emb_thd_absolute(
t: torch.Tensor,
cu_seqlens: torch.Tensor,
freqs: torch.Tensor,
rotary_interleaved: bool = False,
) torch.Tensor#

A baseline implementation of applying RoPE for thd format.

Parameters:
  • t (Tensor) – Input tensor T is of shape [t, h, d]

  • cu_seqlens (Tensor) – Cumulative sum of sequence lengths in a batch for t,

  • consistency. (with shape [b + 1] and dtype torch.int32. Currently unused but kept for API)

  • freqs (Tensor) – Rotary Positional embedding tensor freq is of shape [max_s, 1, 1, d]

Returns:

Shape [t, h, d]. The input tensor after applying RoPE.

Return type:

Tensor

bridge.models.qwen_vl.modelling_qwen3_vl.rope.apply_rotary_pos_emb_absolute(
t: torch.Tensor,
freqs: torch.Tensor,
config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
cu_seqlens: Optional[torch.Tensor] = None,
)#

Reroute to the appropriate apply_rotary_pos_emb function depending on bshd (conventional) / thd (packed seq) format

In Qwen3-VL, the shape of freqs is (seq_length, bs, 1, 2 * dim) instead of [max_seqlen, 1, 1, 2 * dim]