bridge.models.qwen_vl.modelling_qwen3_vl.utils#
Module Contents#
Classes#
Vision Patch Embed for Qwen3VL vision model. |
|
Vision Rotary Embedding for Qwen3VL vision model. |
|
Patch Merger Submodules for Qwen3VL vision model. |
|
Vision Patch Merger for Qwen3VL vision model. |
|
AllGatherVisionEmbeddings for Qwen3VL vision model. |
Functions#
Get the split part by CP and TP for Qwen3VL vision model using zigzag pattern. |
|
Split the deepstack visual embeddings by CP and TP for Qwen3VL vision model. .. note:: |
|
Find the vision id index for Qwen3VL vision model. |
|
Reorganize the inputs for Qwen3VL vision model. |
|
Split the data by CP rank for Qwen3VL vision model, using zigzag pattern. |
|
Expand the THW for Qwen3VL vision model. |
|
Collapse the THW for Qwen3VL vision model. |
|
Split the pixel values and image grid thws for Qwen3VL vision model. |
|
Split the pixel values and image grid thws for Qwen3VL vision model. |
|
Get vision data and grid_thw for context parallelism. |
|
Preprocess packed sequences CP splits sequence into CP*2 chunks, and each GPU gets 2 chunks (GPU0 gets first and last chunks, GPU1 gets second and second last chunks, and so on), this is for load balancing with causal masking. See https://github.com/NVIDIA/TransformerEngine/issues/1368 |
API#
- class bridge.models.qwen_vl.modelling_qwen3_vl.utils.Qwen3VLVisionPatchEmbed(
- config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
Bases:
torch.nn.ModuleVision Patch Embed for Qwen3VL vision model.
Initialization
- forward(hidden_states: torch.Tensor) torch.Tensor#
- class bridge.models.qwen_vl.modelling_qwen3_vl.utils.Qwen3VLVisionRotaryEmbedding(dim: int, theta: float = 10000.0)#
Bases:
torch.nn.ModuleVision Rotary Embedding for Qwen3VL vision model.
Initialization
- forward(seqlen: int) torch.Tensor#
- class bridge.models.qwen_vl.modelling_qwen3_vl.utils.PatchMergerSubmodules#
Patch Merger Submodules for Qwen3VL vision model.
- patch_norm: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- linear_fc1: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- linear_fc2: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- class bridge.models.qwen_vl.modelling_qwen3_vl.utils.Qwen3VLVisionPatchMerger(
- config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
- submodules: bridge.models.qwen_vl.modelling_qwen3_vl.utils.PatchMergerSubmodules,
- use_postshuffle_norm=False,
- tp_group: Optional[torch.distributed.ProcessGroup] = None,
Bases:
megatron.core.transformer.module.MegatronModuleVision Patch Merger for Qwen3VL vision model.
Initialization
- forward(hidden_states)#
- bridge.models.qwen_vl.modelling_qwen3_vl.utils.split_part_by_cp_tp(cp_size, cp_rank, tp_size, tp_rank, split_size)#
Get the split part by CP and TP for Qwen3VL vision model using zigzag pattern.
- bridge.models.qwen_vl.modelling_qwen3_vl.utils.split_deepstack_embs(
- visual_pos_masks: torch.Tensor,
- deepstack_visual_embeds: list[torch.Tensor],
- tp_size: int = 1,
- tp_rank: int = 0,
- cp_size: int = 1,
- cp_rank: int = 0,
- sequence_parallel: bool = False,
Split the deepstack visual embeddings by CP and TP for Qwen3VL vision model. .. note::
first split by cp(zigzag), then split by sp for example cp=2/tp=4 visual_pos_masks will split in 16 part: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] first split by cp(zigzag) is: cp_rank0: [0, 1, 2, 3, 12, 13, 14, 15] cp_rank1: [4, 5, 6, 7, 8, 9, 10, 11] then split by sp: cp_rank0/tp_rank0 = [0, 1] cp_rank0/tp_rank1 = [2, 3] … cp_rank1/tp_rank2 = [8, 9] cp_rank1/tp_rank3 = [10, 11]
- bridge.models.qwen_vl.modelling_qwen3_vl.utils.find_vision_id_index(
- input_ids: torch.Tensor,
- image_token_id: int,
- video_token_id: int,
Find the vision id index for Qwen3VL vision model.
- bridge.models.qwen_vl.modelling_qwen3_vl.utils.reorganize_inputs(
- input_ids: torch.Tensor,
- pixel_values: torch.Tensor = None,
- pixel_values_videos: torch.Tensor = None,
- image_grid_thw: torch.Tensor = None,
- video_grid_thw: torch.Tensor = None,
- image_input_mask: torch.Tensor = None,
- video_input_mask: torch.Tensor = None,
- image_token_id: int = 151655,
- video_token_id: int = 151656,
- square_merge_size: int = 4,
Reorganize the inputs for Qwen3VL vision model.
- bridge.models.qwen_vl.modelling_qwen3_vl.utils.split_data_cp_rank(
- val: torch.Tensor,
- cp_size: int,
- seq_dim: int,
- cp_rank: int = None,
Split the data by CP rank for Qwen3VL vision model, using zigzag pattern.
- bridge.models.qwen_vl.modelling_qwen3_vl.utils.expand_thw(thw: torch.Tensor) torch.Tensor#
Expand the THW for Qwen3VL vision model.
- bridge.models.qwen_vl.modelling_qwen3_vl.utils.collapse_thw(expanded: torch.Tensor) torch.Tensor#
Collapse the THW for Qwen3VL vision model.
- bridge.models.qwen_vl.modelling_qwen3_vl.utils.qwen2vl_pad_and_split(
- cp_size: int,
- hw_factor: int,
- pixel_values: list[torch.Tensor],
- image_grid_thws: list[torch.Tensor],
Split the pixel values and image grid thws for Qwen3VL vision model.
- bridge.models.qwen_vl.modelling_qwen3_vl.utils.qwen3vl_cp_split(
- cp_size: int,
- pixel_values: torch.Tensor,
- image_grid_thw: torch.Tensor,
Split the pixel values and image grid thws for Qwen3VL vision model.
- bridge.models.qwen_vl.modelling_qwen3_vl.utils.get_vision_cp_data(
- vision_data: torch.Tensor,
- vision_grid_thw: torch.Tensor,
- square_merge_size: int,
- cp_img_num: list[int],
- images_padded: list[bool],
- cp_rank: int,
- cp_size: int,
Get vision data and grid_thw for context parallelism.
- Returns:
Vision data of shape [total_thw_size, n_features]. vision_grid_thw (torch.Tensor): Vision grid_thw of shape [total_thw_size, 3]. seqlens_list (list of torch.Tensor): List of seqlens of the vision data in each context parallel rank, for the all gather after vision encoder.
- Return type:
vision_data (torch.Tensor)
- class bridge.models.qwen_vl.modelling_qwen3_vl.utils.AllGatherVisionEmbeddings#
Bases:
torch.autograd.FunctionAllGatherVisionEmbeddings for Qwen3VL vision model.
- static forward(
- ctx,
- input,
- seqlens_on_cp_ranks,
- cp_group: torch.distributed.ProcessGroup,
Forward pass for AllGatherVisionEmbeddings.
- static backward(ctx, grad_output)#
Backward pass for AllGatherVisionEmbeddings.
- bridge.models.qwen_vl.modelling_qwen3_vl.utils.preprocess_packed_seqs(
- input_ids: torch.Tensor,
- attention_mask: torch.Tensor,
- pre_process: bool = True,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
Preprocess packed sequences CP splits sequence into CP*2 chunks, and each GPU gets 2 chunks (GPU0 gets first and last chunks, GPU1 gets second and second last chunks, and so on), this is for load balancing with causal masking. See https://github.com/NVIDIA/TransformerEngine/issues/1368