bridge.models.qwen_vl.modeling_qwen25_vl#

Module Contents#

Classes#

Qwen25VLModel

Qwen2.5 VL Model. (Based on GPT Transformer language model.)

Functions#

is_transformers_min_version

Check if minimum version of transformers is installed.

API#

bridge.models.qwen_vl.modeling_qwen25_vl.is_transformers_min_version(version)#

Check if minimum version of transformers is installed.

class bridge.models.qwen_vl.modeling_qwen25_vl.Qwen25VLModel(
config: megatron.bridge.models.gpt_provider.GPTModelProvider,
pre_process: bool = True,
post_process: bool = True,
vp_stage: Optional[int] = None,
)#

Bases: megatron.core.transformer.module.MegatronModule

Qwen2.5 VL Model. (Based on GPT Transformer language model.)

Parameters:
  • config (GPTModelProvider) – language model provider.

  • transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers

  • vocab_size (int) – Vocabulary size

  • max_sequence_length (int) – maximum size of sequence. This is used for positional embedding

  • pre_process (bool, optional) – Include embedding layer (used with pipeline parallelism). Defaults to True.

  • post_process (bool, optional) – Include an output layer (used with pipeline parallelism). Defaults to True.

  • fp16_lm_cross_entropy (bool, optional) – Defaults to False.

  • parallel_output (bool, optional) – Do not gather the outputs, keep them split across tensor parallel ranks. Defaults to True.

  • share_embeddings_and_output_weights (bool, optional) – When True, input embeddings and output logit weights are shared. Defaults to False.

  • position_embedding_type (Literal[learned_absolute,rope], optional) – Position embedding type.. Defaults to β€˜learned_absolute’.

  • rotary_percent (float, optional) – Percent of rotary dimension to use for rotary position embeddings. Ignored unless position_embedding_type is β€˜rope’. Defaults to 1.0.

  • rotary_base (int, optional) – Base period for rotary position embeddings. Ignored unless position_embedding_type is β€˜rope’. Defaults to 10000.

  • rope_scaling (bool, optional) – Toggle RoPE scaling.

  • rope_scaling_factor (float) – RoPE scaling factor. Default 8.

  • scatter_embedding_sequence_parallel (bool, optional) – Whether embeddings should be scattered across sequence parallel region or not. Defaults to True.

  • seq_len_interpolation_factor (Optional[float], optional) – scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None.

  • pg_collection (ProcessGroupCollection) – Model communication process groups

Initialization

set_input_tensor(input_tensor) None#

Set model chunk input tensor.

forward(
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
pixel_values: Optional[torch.Tensor] = None,
pixel_values_videos: Optional[torch.FloatTensor] = None,
image_grid_thw: Optional[torch.LongTensor] = None,
video_grid_thw: Optional[torch.LongTensor] = None,
second_per_grid_ts: Optional[torch.Tensor] = None,
labels: torch.Tensor = None,
inference_context: megatron.core.inference.contexts.BaseInferenceContext = None,
packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
extra_block_kwargs: dict = None,
runtime_gather_output: Optional[bool] = None,
*,
inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None,
loss_mask: Optional[torch.Tensor] = None,
) torch.Tensor#

image_grid_thw (torch.LongTensor of shape (num_images, 3), optional): The temporal, height and width of feature shape of each image in LLM. video_grid_thw (torch.LongTensor of shape (num_videos, 3), optional): The temporal, height and width of feature shape of each video in LLM. second_per_grid_ts (torch.Tensor of shape (num_videos), optional): The time interval (in seconds) for each grid along the temporal dimension in the 3D position IDs.

freeze(
freeze_language_model: bool,
freeze_vision_model: bool,
freeze_vision_projection: bool,
)#

Freeze model modules.

Make specific modules non-trainable by setting requires_grad to False.

Parameters:
  • freeze_language_model (bool) – Freeze the language model module.

  • freeze_vision_model (bool) – Freeze the vision model module (patch_embed and blocks).

  • freeze_vision_projection (bool) – Freeze the vision projection module (merger).