bridge.models.qwen_omni.modeling_qwen3_omni.rope#
Qwen3-Omni multimodal RoPE helpers ported from the HF thinker implementation.
Module Contents#
Functions#
Compute Qwen3-Omni thinker audio token lengths from feature lengths. |
|
Build 3D multimodal RoPE ids for image or video features. |
|
Build multimodal RoPE ids for text, image, video, and audio token layouts. |
API#
- bridge.models.qwen_omni.modeling_qwen3_omni.rope._get_feat_extract_output_lengths(
- input_lengths: torch.Tensor,
Compute Qwen3-Omni thinker audio token lengths from feature lengths.
- bridge.models.qwen_omni.modeling_qwen3_omni.rope.get_llm_pos_ids_for_vision(
- start_idx: int,
- vision_idx: int,
- spatial_merge_size: int,
- t_index: torch.Tensor,
- grid_hs: torch.Tensor,
- grid_ws: torch.Tensor,
Build 3D multimodal RoPE ids for image or video features.
- bridge.models.qwen_omni.modeling_qwen3_omni.rope._count_run(tokens: list[int], start: int, token_id: int) int#
- bridge.models.qwen_omni.modeling_qwen3_omni.rope._next_position_start(
- llm_pos_ids_list: list[torch.Tensor],
- bridge.models.qwen_omni.modeling_qwen3_omni.rope.get_rope_index(
- spatial_merge_size: int,
- image_token_id: int,
- video_token_id: int,
- audio_token_id: int,
- vision_start_token_id: int,
- audio_start_token_id: int,
- position_id_per_seconds: int,
- input_ids: torch.LongTensor | None = None,
- image_grid_thw: torch.LongTensor | None = None,
- video_grid_thw: torch.LongTensor | None = None,
- attention_mask: torch.Tensor | None = None,
- use_audio_in_video: bool = False,
- audio_seqlens: torch.LongTensor | None = None,
- second_per_grids: torch.Tensor | None = None,
Build multimodal RoPE ids for text, image, video, and audio token layouts.
This mirrors the HF Qwen3-Omni thinker implementation so local Megatron smoke tests exercise the same placeholder ordering and audio/image position handling.