`bridge.models.qwen_omni.modeling_qwen25_omni.rope`#

Module Contents#

Functions#

`_get_feat_extract_output_lengths`	Computes the output length of the convolutional layers and the audio encoder for Qwen2.5-Omni.
`get_llm_pos_ids_for_vision`	Get LLM position IDs for vision tokens (3D: temporal, height, width).
`get_chunked_index`	Splits token index list into chunks based on token value ranges.
`get_rope_index`	Calculate the 3D rope index based on image and video’s temporal, height and width in LLM.

API#

bridge.models.qwen_omni.modeling_qwen25_omni.rope._get_feat_extract_output_lengths(input_lengths)#

Computes the output length of the convolutional layers and the audio encoder for Qwen2.5-Omni.

Formula: feat = (input_lengths - 1) // 2 + 1, output = (feat - 2) // 2 + 1

bridge.models.qwen_omni.modeling_qwen25_omni.rope.get_llm_pos_ids_for_vision( start_idx: int, vision_idx: int, spatial_merge_size: int, t_index: list[torch.Tensor], grid_hs: list[torch.Tensor], grid_ws: list[torch.Tensor], )#: Get LLM position IDs for vision tokens (3D: temporal, height, width).

bridge.models.qwen_omni.modeling_qwen25_omni.rope.get_chunked_index( token_indices: torch.Tensor, tokens_per_chunk: int, remove_index: int, ) → list[tuple[int, int]]#

Splits token index list into chunks based on token value ranges.

Given a list of token indices, returns a list of (start, end) index tuples representing slices of the list where the token values fall within successive ranges of tokens_per_chunk.

bridge.models.qwen_omni.modeling_qwen25_omni.rope.get_rope_index( spatial_merge_size: int, image_token_id: int, video_token_id: int, audio_token_id: int, vision_start_token_id: int, audio_start_token_id: int, position_id_per_seconds: int, seconds_per_chunk: int = 2, input_ids: torch.LongTensor | None = None, image_grid_thw: torch.LongTensor | None = None, video_grid_thw: torch.LongTensor | None = None, attention_mask: torch.Tensor | None = None, use_audio_in_video: bool = False, audio_seqlens: torch.LongTensor | None = None, second_per_grids: torch.Tensor | None = None, ) → tuple[torch.Tensor, torch.Tensor]#

Calculate the 3D rope index based on image and video’s temporal, height and width in LLM.

Ported from HF Qwen2_5OmniThinkerForConditionalGeneration.get_rope_index as a standalone function.

Key differences from Qwen3 Omni MoE rope:

Audio output length: ((audio_seqlens - 1) // 2 + 1 - 2) // 2 + 1
Token scanning: searches for image_token_id/video_token_id/audio_token_id directly
Has seconds_per_chunk for audio-in-video interleaving
Uses get_chunked_index for audio-in-video chunk interleaving

bridge.models.qwen_omni.modeling_qwen25_omni.rope#

Module Contents#

Functions#

API#

`bridge.models.qwen_omni.modeling_qwen25_omni.rope`#