`bridge.models.qwen_vl.modeling_qwen25_vl`#

Module Contents#

Classes#

Qwen25VLModel

Qwen2.5 VL Model. (Based on GPT Transformer language model.)

Functions#

is_transformers_min_version

Check if minimum version of transformers is installed.

API#

bridge.models.qwen_vl.modeling_qwen25_vl.is_transformers_min_version(version)#: Check if minimum version of transformers is installed.

class bridge.models.qwen_vl.modeling_qwen25_vl.Qwen25VLModel( config: megatron.bridge.models.gpt_provider.GPTModelProvider, pre_process: bool = True, post_process: bool = True, vp_stage: Optional[int] = None, )#

Bases: megatron.core.transformer.module.MegatronModule

Qwen2.5 VL Model. (Based on GPT Transformer language model.)

Parameters:

config (GPTModelProvider) – language model provider.
transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers
vocab_size (int) – Vocabulary size
max_sequence_length (int) – maximum size of sequence. This is used for positional embedding
pre_process (bool, optional) – Include embedding layer (used with pipeline parallelism). Defaults to True.
post_process (bool, optional) – Include an output layer (used with pipeline parallelism). Defaults to True.
fp16_lm_cross_entropy (bool, optional) – Defaults to False.
parallel_output (bool, optional) – Do not gather the outputs, keep them split across tensor parallel ranks. Defaults to True.
share_embeddings_and_output_weights (bool, optional) – When True, input embeddings and output logit weights are shared. Defaults to False.
position_embedding_type (Literal[learned_absolute,rope], optional) – Position embedding type.. Defaults to ‘learned_absolute’.
rotary_percent (float, optional) – Percent of rotary dimension to use for rotary position embeddings. Ignored unless position_embedding_type is ‘rope’. Defaults to 1.0.
rotary_base (int, optional) – Base period for rotary position embeddings. Ignored unless position_embedding_type is ‘rope’. Defaults to 10000.
rope_scaling (bool, optional) – Toggle RoPE scaling.
rope_scaling_factor (float) – RoPE scaling factor. Default 8.
scatter_embedding_sequence_parallel (bool, optional) – Whether embeddings should be scattered across sequence parallel region or not. Defaults to True.
seq_len_interpolation_factor (Optional[float], optional) – scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None.
pg_collection (ProcessGroupCollection) – Model communication process groups

Initialization

property decoder#

Expose language model decoder for mcore inference compatibility.

mcore’s MambaInferenceStateConfig.from_model() calls get_attr_wrapped_model(model, “decoder”), which only traverses .module wrappers. VLM models store the decoder under language_model.decoder, so we expose it here to allow the Mamba check to run and correctly return None.

set_input_tensor(input_tensor) → None#: Set model chunk input tensor.

forward( input_ids: torch.LongTensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, inputs_embeds: Optional[torch.FloatTensor] = None, pixel_values: Optional[torch.Tensor] = None, pixel_values_videos: Optional[torch.FloatTensor] = None, image_grid_thw: Optional[torch.LongTensor] = None, video_grid_thw: Optional[torch.LongTensor] = None, second_per_grid_ts: Optional[torch.Tensor] = None, labels: torch.Tensor = None, inference_context: megatron.core.inference.contexts.BaseInferenceContext = None, packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None, extra_block_kwargs: dict = None, runtime_gather_output: Optional[bool] = None, *, inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None, loss_mask: Optional[torch.Tensor] = None, ) → torch.Tensor#: image_grid_thw (torch.LongTensor of shape (num_images, 3), optional): The temporal, height and width of feature shape of each image in LLM. video_grid_thw (torch.LongTensor of shape (num_videos, 3), optional): The temporal, height and width of feature shape of each video in LLM. second_per_grid_ts (torch.Tensor of shape (num_videos), optional): The time interval (in seconds) for each grid along the temporal dimension in the 3D position IDs.

freeze( freeze_language_model: bool, freeze_vision_model: bool, freeze_vision_projection: bool, )#

Freeze model modules.

Make specific modules non-trainable by setting requires_grad to False.