bridge.models.qwen_vl.modeling_qwen25_vl#
Module Contents#
Classes#
Qwen2.5 VL Model. (Based on GPT Transformer language model.) |
Functions#
Check if minimum version of transformers is installed. |
API#
- bridge.models.qwen_vl.modeling_qwen25_vl.is_transformers_min_version(version)#
Check if minimum version of transformers is installed.
- class bridge.models.qwen_vl.modeling_qwen25_vl.Qwen25VLModel(
- config: megatron.bridge.models.gpt_provider.GPTModelProvider,
- pre_process: bool = True,
- post_process: bool = True,
- vp_stage: Optional[int] = None,
Bases:
megatron.core.transformer.module.MegatronModuleQwen2.5 VL Model. (Based on GPT Transformer language model.)
- Parameters:
config (GPTModelProvider) – language model provider.
transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers
vocab_size (int) – Vocabulary size
max_sequence_length (int) – maximum size of sequence. This is used for positional embedding
pre_process (bool, optional) – Include embedding layer (used with pipeline parallelism). Defaults to True.
post_process (bool, optional) – Include an output layer (used with pipeline parallelism). Defaults to True.
fp16_lm_cross_entropy (bool, optional) – Defaults to False.
parallel_output (bool, optional) – Do not gather the outputs, keep them split across tensor parallel ranks. Defaults to True.
share_embeddings_and_output_weights (bool, optional) – When True, input embeddings and output logit weights are shared. Defaults to False.
position_embedding_type (Literal[learned_absolute,rope], optional) – Position embedding type.. Defaults to ‘learned_absolute’.
rotary_percent (float, optional) – Percent of rotary dimension to use for rotary position embeddings. Ignored unless position_embedding_type is ‘rope’. Defaults to 1.0.
rotary_base (int, optional) – Base period for rotary position embeddings. Ignored unless position_embedding_type is ‘rope’. Defaults to 10000.
rope_scaling (bool, optional) – Toggle RoPE scaling.
rope_scaling_factor (float) – RoPE scaling factor. Default 8.
scatter_embedding_sequence_parallel (bool, optional) – Whether embeddings should be scattered across sequence parallel region or not. Defaults to True.
seq_len_interpolation_factor (Optional[float], optional) – scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None.
pg_collection (ProcessGroupCollection) – Model communication process groups
Initialization
- property decoder#
Expose language model decoder for mcore inference compatibility.
mcore’s MambaInferenceStateConfig.from_model() calls get_attr_wrapped_model(model, “decoder”), which only traverses .module wrappers. VLM models store the decoder under language_model.decoder, so we expose it here to allow the Mamba check to run and correctly return None.
- set_input_tensor(input_tensor) None#
Set model chunk input tensor.
- forward(
- input_ids: torch.LongTensor = None,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- pixel_values: Optional[torch.Tensor] = None,
- pixel_values_videos: Optional[torch.FloatTensor] = None,
- image_grid_thw: Optional[torch.LongTensor] = None,
- video_grid_thw: Optional[torch.LongTensor] = None,
- second_per_grid_ts: Optional[torch.Tensor] = None,
- labels: torch.Tensor = None,
- inference_context: megatron.core.inference.contexts.BaseInferenceContext = None,
- packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
- extra_block_kwargs: dict = None,
- runtime_gather_output: Optional[bool] = None,
- *,
- inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None,
- loss_mask: Optional[torch.Tensor] = None,
image_grid_thw (
torch.LongTensorof shape(num_images, 3), optional): The temporal, height and width of feature shape of each image in LLM. video_grid_thw (torch.LongTensorof shape(num_videos, 3), optional): The temporal, height and width of feature shape of each video in LLM. second_per_grid_ts (torch.Tensorof shape(num_videos), optional): The time interval (in seconds) for each grid along the temporal dimension in the 3D position IDs.
- freeze(
- freeze_language_model: bool,
- freeze_vision_model: bool,
- freeze_vision_projection: bool,
Freeze model modules.
Make specific modules non-trainable by setting requires_grad to False.
- Parameters:
freeze_language_model (bool) – Freeze the language model module.
freeze_vision_model (bool) – Freeze the vision model module (patch_embed and blocks).
freeze_vision_projection (bool) – Freeze the vision projection module (merger).