bridge.models.qwen_vl.modelling_qwen3_vl.utils#

Module Contents#

Classes#

Qwen3VLVisionPatchEmbed

Vision Patch Embed for Qwen3VL vision model.

Qwen3VLVisionRotaryEmbedding

Vision Rotary Embedding for Qwen3VL vision model.

PatchMergerSubmodules

Patch Merger Submodules for Qwen3VL vision model.

Qwen3VLVisionPatchMerger

Vision Patch Merger for Qwen3VL vision model.

AllGatherVisionEmbeddings

AllGatherVisionEmbeddings for Qwen3VL vision model.

Functions#

split_part_by_cp_tp

Get the split part by CP and TP for Qwen3VL vision model using zigzag pattern.

split_deepstack_embs

Split the deepstack visual embeddings by CP and TP for Qwen3VL vision model. .. note::

find_vision_id_index

Find the vision id index for Qwen3VL vision model.

reorganize_inputs

Reorganize the inputs for Qwen3VL vision model.

split_data_cp_rank

Split the data by CP rank for Qwen3VL vision model, using zigzag pattern.

expand_thw

Expand the THW for Qwen3VL vision model.

collapse_thw

Collapse the THW for Qwen3VL vision model.

qwen2vl_pad_and_split

Split the pixel values and image grid thws for Qwen3VL vision model.

qwen3vl_cp_split

Split the pixel values and image grid thws for Qwen3VL vision model.

get_vision_cp_data

Get vision data and grid_thw for context parallelism.

preprocess_packed_seqs

Preprocess packed sequences CP splits sequence into CP*2 chunks, and each GPU gets 2 chunks (GPU0 gets first and last chunks, GPU1 gets second and second last chunks, and so on), this is for load balancing with causal masking. See https://github.com/NVIDIA/TransformerEngine/issues/1368

API#

class bridge.models.qwen_vl.modelling_qwen3_vl.utils.Qwen3VLVisionPatchEmbed(
config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
)#

Bases: torch.nn.Module

Vision Patch Embed for Qwen3VL vision model.

Initialization

forward(hidden_states: torch.Tensor) torch.Tensor#
class bridge.models.qwen_vl.modelling_qwen3_vl.utils.Qwen3VLVisionRotaryEmbedding(dim: int, theta: float = 10000.0)#

Bases: torch.nn.Module

Vision Rotary Embedding for Qwen3VL vision model.

Initialization

forward(seqlen: int) torch.Tensor#
class bridge.models.qwen_vl.modelling_qwen3_vl.utils.PatchMergerSubmodules#

Patch Merger Submodules for Qwen3VL vision model.

patch_norm: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

linear_fc1: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

linear_fc2: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

class bridge.models.qwen_vl.modelling_qwen3_vl.utils.Qwen3VLVisionPatchMerger(
config: megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config.Qwen3VLTransformerConfig,
submodules: bridge.models.qwen_vl.modelling_qwen3_vl.utils.PatchMergerSubmodules,
use_postshuffle_norm=False,
tp_group: Optional[torch.distributed.ProcessGroup] = None,
)#

Bases: megatron.core.transformer.module.MegatronModule

Vision Patch Merger for Qwen3VL vision model.

Initialization

forward(hidden_states)#
bridge.models.qwen_vl.modelling_qwen3_vl.utils.split_part_by_cp_tp(cp_size, cp_rank, tp_size, tp_rank, split_size)#

Get the split part by CP and TP for Qwen3VL vision model using zigzag pattern.

bridge.models.qwen_vl.modelling_qwen3_vl.utils.split_deepstack_embs(
visual_pos_masks: torch.Tensor,
deepstack_visual_embeds: list[torch.Tensor],
tp_size: int = 1,
tp_rank: int = 0,
cp_size: int = 1,
cp_rank: int = 0,
sequence_parallel: bool = False,
)#

Split the deepstack visual embeddings by CP and TP for Qwen3VL vision model. .. note::

first split by cp(zigzag), then split by sp for example cp=2/tp=4 visual_pos_masks will split in 16 part: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] first split by cp(zigzag) is: cp_rank0: [0, 1, 2, 3, 12, 13, 14, 15] cp_rank1: [4, 5, 6, 7, 8, 9, 10, 11] then split by sp: cp_rank0/tp_rank0 = [0, 1] cp_rank0/tp_rank1 = [2, 3] … cp_rank1/tp_rank2 = [8, 9] cp_rank1/tp_rank3 = [10, 11]

bridge.models.qwen_vl.modelling_qwen3_vl.utils.find_vision_id_index(
input_ids: torch.Tensor,
image_token_id: int,
video_token_id: int,
)#

Find the vision id index for Qwen3VL vision model.

bridge.models.qwen_vl.modelling_qwen3_vl.utils.reorganize_inputs(
input_ids: torch.Tensor,
pixel_values: torch.Tensor = None,
pixel_values_videos: torch.Tensor = None,
image_grid_thw: torch.Tensor = None,
video_grid_thw: torch.Tensor = None,
image_input_mask: torch.Tensor = None,
video_input_mask: torch.Tensor = None,
image_token_id: int = 151655,
video_token_id: int = 151656,
square_merge_size: int = 4,
)#

Reorganize the inputs for Qwen3VL vision model.

bridge.models.qwen_vl.modelling_qwen3_vl.utils.split_data_cp_rank(
val: torch.Tensor,
cp_size: int,
seq_dim: int,
cp_rank: int = None,
)#

Split the data by CP rank for Qwen3VL vision model, using zigzag pattern.

bridge.models.qwen_vl.modelling_qwen3_vl.utils.expand_thw(thw: torch.Tensor) torch.Tensor#

Expand the THW for Qwen3VL vision model.

bridge.models.qwen_vl.modelling_qwen3_vl.utils.collapse_thw(expanded: torch.Tensor) torch.Tensor#

Collapse the THW for Qwen3VL vision model.

bridge.models.qwen_vl.modelling_qwen3_vl.utils.qwen2vl_pad_and_split(
cp_size: int,
hw_factor: int,
pixel_values: list[torch.Tensor],
image_grid_thws: list[torch.Tensor],
)#

Split the pixel values and image grid thws for Qwen3VL vision model.

bridge.models.qwen_vl.modelling_qwen3_vl.utils.qwen3vl_cp_split(
cp_size: int,
pixel_values: torch.Tensor,
image_grid_thw: torch.Tensor,
)#

Split the pixel values and image grid thws for Qwen3VL vision model.

bridge.models.qwen_vl.modelling_qwen3_vl.utils.get_vision_cp_data(
vision_data: torch.Tensor,
vision_grid_thw: torch.Tensor,
square_merge_size: int,
cp_img_num: list[int],
images_padded: list[bool],
cp_rank: int,
cp_size: int,
)#

Get vision data and grid_thw for context parallelism.

Returns:

Vision data of shape [total_thw_size, n_features]. vision_grid_thw (torch.Tensor): Vision grid_thw of shape [total_thw_size, 3]. seqlens_list (list of torch.Tensor): List of seqlens of the vision data in each context parallel rank, for the all gather after vision encoder.

Return type:

vision_data (torch.Tensor)

class bridge.models.qwen_vl.modelling_qwen3_vl.utils.AllGatherVisionEmbeddings#

Bases: torch.autograd.Function

AllGatherVisionEmbeddings for Qwen3VL vision model.

static forward(
ctx,
input,
seqlens_on_cp_ranks,
cp_group: torch.distributed.ProcessGroup,
)#

Forward pass for AllGatherVisionEmbeddings.

static backward(ctx, grad_output)#

Backward pass for AllGatherVisionEmbeddings.

bridge.models.qwen_vl.modelling_qwen3_vl.utils.preprocess_packed_seqs(
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
pre_process: bool = True,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
) tuple[torch.Tensor, megatron.core.packed_seq_params.PackedSeqParams]#

Preprocess packed sequences CP splits sequence into CP*2 chunks, and each GPU gets 2 chunks (GPU0 gets first and last chunks, GPU1 gets second and second last chunks, and so on), this is for load balancing with causal masking. See https://github.com/NVIDIA/TransformerEngine/issues/1368