bridge.models.qwen_omni.modeling_qwen25_omni.rope#

Module Contents#

Functions#

_get_feat_extract_output_lengths

Computes the output length of the convolutional layers and the audio encoder for Qwen2.5-Omni.

get_llm_pos_ids_for_vision

Get LLM position IDs for vision tokens (3D: temporal, height, width).

get_chunked_index

Splits token index list into chunks based on token value ranges.

get_rope_index

Calculate the 3D rope index based on image and video’s temporal, height and width in LLM.

API#

bridge.models.qwen_omni.modeling_qwen25_omni.rope._get_feat_extract_output_lengths(input_lengths)#

Computes the output length of the convolutional layers and the audio encoder for Qwen2.5-Omni.

Formula: feat = (input_lengths - 1) // 2 + 1, output = (feat - 2) // 2 + 1

bridge.models.qwen_omni.modeling_qwen25_omni.rope.get_llm_pos_ids_for_vision(
start_idx: int,
vision_idx: int,
spatial_merge_size: int,
t_index: list[torch.Tensor],
grid_hs: list[torch.Tensor],
grid_ws: list[torch.Tensor],
)#

Get LLM position IDs for vision tokens (3D: temporal, height, width).

bridge.models.qwen_omni.modeling_qwen25_omni.rope.get_chunked_index(
token_indices: torch.Tensor,
tokens_per_chunk: int,
remove_index: int,
) list[tuple[int, int]]#

Splits token index list into chunks based on token value ranges.

Given a list of token indices, returns a list of (start, end) index tuples representing slices of the list where the token values fall within successive ranges of tokens_per_chunk.

bridge.models.qwen_omni.modeling_qwen25_omni.rope.get_rope_index(
spatial_merge_size: int,
image_token_id: int,
video_token_id: int,
audio_token_id: int,
vision_start_token_id: int,
audio_start_token_id: int,
position_id_per_seconds: int,
seconds_per_chunk: int = 2,
input_ids: torch.LongTensor | None = None,
image_grid_thw: torch.LongTensor | None = None,
video_grid_thw: torch.LongTensor | None = None,
attention_mask: torch.Tensor | None = None,
use_audio_in_video: bool = False,
audio_seqlens: torch.LongTensor | None = None,
second_per_grids: torch.Tensor | None = None,
) tuple[torch.Tensor, torch.Tensor]#

Calculate the 3D rope index based on image and video’s temporal, height and width in LLM.

Ported from HF Qwen2_5OmniThinkerForConditionalGeneration.get_rope_index as a standalone function.

Key differences from Qwen3 Omni MoE rope:

  • Audio output length: ((audio_seqlens - 1) // 2 + 1 - 2) // 2 + 1

  • Token scanning: searches for image_token_id/video_token_id/audio_token_id directly

  • Has seconds_per_chunk for audio-in-video interleaving

  • Uses get_chunked_index for audio-in-video chunk interleaving