bridge.models.qwen_omni.modeling_qwen3_omni.rope#

Qwen3-Omni multimodal RoPE helpers ported from the HF thinker implementation.

Module Contents#

Functions#

_get_feat_extract_output_lengths

Compute Qwen3-Omni thinker audio token lengths from feature lengths.

get_llm_pos_ids_for_vision

Build 3D multimodal RoPE ids for image or video features.

_count_run

_next_position_start

get_rope_index

Build multimodal RoPE ids for text, image, video, and audio token layouts.

API#

bridge.models.qwen_omni.modeling_qwen3_omni.rope._get_feat_extract_output_lengths(
input_lengths: torch.Tensor,
) torch.Tensor#

Compute Qwen3-Omni thinker audio token lengths from feature lengths.

bridge.models.qwen_omni.modeling_qwen3_omni.rope.get_llm_pos_ids_for_vision(
start_idx: int,
vision_idx: int,
spatial_merge_size: int,
t_index: torch.Tensor,
grid_hs: torch.Tensor,
grid_ws: torch.Tensor,
) torch.Tensor#

Build 3D multimodal RoPE ids for image or video features.

bridge.models.qwen_omni.modeling_qwen3_omni.rope._count_run(tokens: list[int], start: int, token_id: int) int#
bridge.models.qwen_omni.modeling_qwen3_omni.rope._next_position_start(
llm_pos_ids_list: list[torch.Tensor],
) torch.Tensor | int#
bridge.models.qwen_omni.modeling_qwen3_omni.rope.get_rope_index(
spatial_merge_size: int,
image_token_id: int,
video_token_id: int,
audio_token_id: int,
vision_start_token_id: int,
audio_start_token_id: int,
position_id_per_seconds: int,
input_ids: torch.LongTensor | None = None,
image_grid_thw: torch.LongTensor | None = None,
video_grid_thw: torch.LongTensor | None = None,
attention_mask: torch.Tensor | None = None,
use_audio_in_video: bool = False,
audio_seqlens: torch.LongTensor | None = None,
second_per_grids: torch.Tensor | None = None,
) tuple[torch.Tensor, torch.Tensor]#

Build multimodal RoPE ids for text, image, video, and audio token layouts.

This mirrors the HF Qwen3-Omni thinker implementation so local Megatron smoke tests exercise the same placeholder ordering and audio/image position handling.