`bridge.models.qwen_vl.modelling_qwen3_vl.rope`#

Module Contents#

Classes#

Qwen3VLMultimodalRotaryEmbedding

Multimodal Rotary Embedding for language model. only support for qwen3vl

Functions#

`_get_flat_packed_ranges`	Return `(padded_start, valid_end, padded_end)` ranges for flat packed input.
`get_packed_seq_attention_mask`	Build a dense keep mask matching packed sequence metadata.
`_build_llm_rope_positions`	Build Qwen3-VL MRoPE positions for one logical sample.
`get_rope_index`	Different from the original implementation, Qwen3VL use timestamps rather than absolute time position ids.
`apply_rotary_pos_emb_thd_absolute`	A baseline implementation of applying RoPE for `thd` format.
`apply_rotary_pos_emb_absolute`	Reroute to the appropriate apply_rotary_pos_emb function depending on bshd (conventional) / thd (packed seq) format

API#

bridge.models.qwen_vl.modelling_qwen3_vl.rope._get_flat_packed_ranges( input_ids: torch.Tensor, packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams | None, ) → list[tuple[int, int, int]] | None#: Return (padded_start, valid_end, padded_end) ranges for flat packed input.

bridge.models.qwen_vl.modelling_qwen3_vl.rope.get_packed_seq_attention_mask( input_ids: torch.Tensor, packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams, ) → torch.Tensor#

Build a dense keep mask matching packed sequence metadata.

Collate-time in-batch packing emits a flattened [1, total_padded] token tensor. cu_seqlens_q_padded identifies segment boundaries in that flattened tensor, while cu_seqlens_q may identify the unpadded token counts. Qwen3-VL still needs a dense mask for its local THD conversion, so derive it from the same metadata used by attention.

class bridge.models.qwen_vl.modelling_qwen3_vl.rope.Qwen3VLMultimodalRotaryEmbedding( kv_channels: int, rotary_percent: float = 1.0, rotary_interleaved: bool = False, seq_len_interpolation_factor: Optional[float] = None, rotary_base: int = 10000, cp_group: torch.distributed.ProcessGroup = None, )#

Bases: torch.nn.Module

Multimodal Rotary Embedding for language model. only support for qwen3vl

Parameters:

kv_channels (int) – Projection weights dimension in multi-head attention. Obtained from transformer config
rotary_percent (float) – Percent of rotary dimension to use for rotary position embeddings.
rotary_interleaved (bool, optional) – If True, interleaved rotary position embeddings. Defaults to False.
seq_len_interpolation_factor (float, optional) – scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None
rotary_base (int, optional) – Base period for rotary position embeddings. Defaults to 10000.

Initialization

apply_interleaved_mrope(freqs, mrope_section)#

Apply interleaved MRoPE to 3D rotary embeddings. Reorganizes frequency layout from chunked [TTT…HHH…WWW] to interleaved [THTHWHTHW…TT], preserving frequency continuity.

Parameters:

x – (3, bs, seq_len, head_dim // 2)
mrope_section – (3,)

Returns:

(bs, seq_len, head_dim // 2)

Return type:

x_t

forward(

position_ids: torch.Tensor,

mrope_section: List[int] | None,

packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,

**kwargs,

) → torch.Tensor#

Forward pass of multimodal RoPE embedding.

Parameters:

position_ids (torch.Tensor) – A postion_id tensor with shape [3, batchsize, seqlens]
mrope_section (list[int]) – Multimodal rope section is for channel dimension of temporal, height and width in rope calculation.
packed_seq_params (PackedSeqParams, optional) – Packed sequence params. Defaults to None.

Returns:

Embeddings after applying RoPE.

Return type:

Tensor

get_rotary_seq_len( inference_context: megatron.core.inference.contexts.BaseInferenceContext, transformer: megatron.core.transformer.transformer_block.TransformerBlock, transformer_input: torch.Tensor, transformer_config: megatron.core.transformer.transformer_config.TransformerConfig, packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None, *, inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None, ) → int#

Compatibility shim for newer MCore GPT preprocessing.

Qwen3-VL/Qwen3-Omni mRoPE uses explicit multimodal position_ids, but the upstream GPT preprocess path still queries a rotary sequence length helper when preparing inputs.

bridge.models.qwen_vl.modelling_qwen3_vl.rope._build_llm_rope_positions( sample_input_ids: torch.Tensor, *, spatial_merge_size: int, image_token_id: int, video_token_id: int, vision_start_token_id: int, image_grid_thw: torch.Tensor | None, video_grid_thw: torch.Tensor | None, image_index: int, video_index: int, ) → tuple[torch.Tensor, int, int]#: Build Qwen3-VL MRoPE positions for one logical sample.

bridge.models.qwen_vl.modelling_qwen3_vl.rope.get_rope_index( spatial_merge_size: int, image_token_id: int, video_token_id: int, vision_start_token_id: int, input_ids: Optional[torch.LongTensor] = None, image_grid_thw: Optional[torch.LongTensor] = None, video_grid_thw: Optional[torch.LongTensor] = None, attention_mask: Optional[torch.Tensor] = None, packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None, ) → tuple[torch.Tensor, torch.Tensor]#: Different from the original implementation, Qwen3VL use timestamps rather than absolute time position ids.

bridge.models.qwen_vl.modelling_qwen3_vl.rope.apply_rotary_pos_emb_thd_absolute( t: torch.Tensor, cu_seqlens: torch.Tensor, freqs: torch.Tensor, rotary_interleaved: bool = False, ) → torch.Tensor#

A baseline implementation of applying RoPE for thd format.

Parameters:

t (Tensor) – Input tensor T is of shape [t, h, d]
cu_seqlens (Tensor) – Cumulative sum of sequence lengths in a batch for t,
consistency. (with shape [b + 1] and dtype torch.int32. Currently unused but kept for API)
freqs (Tensor) – Rotary Positional embedding tensor freq is of shape [max_s, 1, 1, d]

Returns:

Shape [t, h, d]. The input tensor after applying RoPE.

Return type:

Tensor

bridge.models.qwen_vl.modelling_qwen3_vl.rope.apply_rotary_pos_emb_absolute( t: torch.Tensor, freqs: torch.Tensor, config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig, cu_seqlens: Optional[torch.Tensor] = None, )#

Reroute to the appropriate apply_rotary_pos_emb function depending on bshd (conventional) / thd (packed seq) format

In Qwen3-VL, the shape of freqs is (seq_length, bs, 1, 2 * dim) instead of [max_seqlen, 1, 1, 2 * dim]

bridge.models.qwen_vl.modelling_qwen3_vl.rope#

Module Contents#

Classes#

Functions#

API#

`bridge.models.qwen_vl.modelling_qwen3_vl.rope`#