bridge.models.qwen_omni.modeling_qwen25_omni.rope#
Module Contents#
Functions#
Computes the output length of the convolutional layers and the audio encoder for Qwen2.5-Omni. |
|
Get LLM position IDs for vision tokens (3D: temporal, height, width). |
|
Splits token index list into chunks based on token value ranges. |
|
Calculate the 3D rope index based on image and video’s temporal, height and width in LLM. |
API#
- bridge.models.qwen_omni.modeling_qwen25_omni.rope._get_feat_extract_output_lengths(input_lengths)#
Computes the output length of the convolutional layers and the audio encoder for Qwen2.5-Omni.
Formula: feat = (input_lengths - 1) // 2 + 1, output = (feat - 2) // 2 + 1
- bridge.models.qwen_omni.modeling_qwen25_omni.rope.get_llm_pos_ids_for_vision(
- start_idx: int,
- vision_idx: int,
- spatial_merge_size: int,
- t_index: list[torch.Tensor],
- grid_hs: list[torch.Tensor],
- grid_ws: list[torch.Tensor],
Get LLM position IDs for vision tokens (3D: temporal, height, width).
- bridge.models.qwen_omni.modeling_qwen25_omni.rope.get_chunked_index(
- token_indices: torch.Tensor,
- tokens_per_chunk: int,
- remove_index: int,
Splits token index list into chunks based on token value ranges.
Given a list of token indices, returns a list of (start, end) index tuples representing slices of the list where the token values fall within successive ranges of tokens_per_chunk.
- bridge.models.qwen_omni.modeling_qwen25_omni.rope.get_rope_index(
- spatial_merge_size: int,
- image_token_id: int,
- video_token_id: int,
- audio_token_id: int,
- vision_start_token_id: int,
- audio_start_token_id: int,
- position_id_per_seconds: int,
- seconds_per_chunk: int = 2,
- input_ids: torch.LongTensor | None = None,
- image_grid_thw: torch.LongTensor | None = None,
- video_grid_thw: torch.LongTensor | None = None,
- attention_mask: torch.Tensor | None = None,
- use_audio_in_video: bool = False,
- audio_seqlens: torch.LongTensor | None = None,
- second_per_grids: torch.Tensor | None = None,
Calculate the 3D rope index based on image and video’s temporal, height and width in LLM.
Ported from HF Qwen2_5OmniThinkerForConditionalGeneration.get_rope_index as a standalone function.
Key differences from Qwen3 Omni MoE rope:
Audio output length: ((audio_seqlens - 1) // 2 + 1 - 2) // 2 + 1
Token scanning: searches for image_token_id/video_token_id/audio_token_id directly
Has seconds_per_chunk for audio-in-video interleaving
Uses get_chunked_index for audio-in-video chunk interleaving