`bridge.models.gemma_vl.modeling_gemma3_vl`#

Module Contents#

Classes#

`Gemma3VLModel`	Gemma3 Vision-Language (VL) model wrapper for Megatron.
`Gemma3VLMultimodalProjectorConfig`	Gemma3 VL multimodal projector config
`Gemma3VLMultimodalProjector`	Gemma3 VL multimodal projector

API#

class bridge.models.gemma_vl.modeling_gemma3_vl.Gemma3VLModel( config: megatron.bridge.models.gpt_provider.GPTModelProvider, pre_process: bool = True, post_process: bool = True, vp_stage: Optional[int] = None, )#

Bases: megatron.core.transformer.module.MegatronModule

Gemma3 Vision-Language (VL) model wrapper for Megatron.

Parameters:

config (GPTModelProvider) – Model provider containing configuration for language and vision modules.
pre_process (bool, optional) – Whether to construct the vision tower and projector. Default: True.
post_process (bool, optional) – Whether to apply post-processing. Default: True.
vp_stage (Optional[int], optional) – Pipeline stage for model parallelism. Default: None.

.. attribute:: pre_process

If True, enables vision and multimodal components.

Type:: bool

.. attribute:: post_process

If True, enables post-processing.

Type:: bool

.. attribute:: vp_stage

Pipeline stage for model parallelism.

Type:: Optional[int]

.. attribute:: vision_tower

Vision encoder (e.g., SigLIP or other vision backbone).

Type:: nn.Module

.. attribute:: multi_modal_projector

Projects vision features to language model space.

Type:: nn.Module

.. attribute:: language_model

The underlying language model.

Type:: nn.Module

.. attribute:: get_image_features

Method to extract image features, compatible with HuggingFace Gemma3Model.

Type:: callable

Forward Inputs: input_ids (torch.LongTensor, optional): Tokenized input ids for the language model. attention_mask (torch.Tensor, optional): Attention mask for the language model. position_ids (torch.LongTensor, optional): Position ids for the language model. inputs_embeds (torch.FloatTensor, optional): Precomputed input embeddings. pixel_values (torch.Tensor, optional): Image tensor(s) for the vision tower. labels (torch.Tensor, optional): Target labels for supervised training. runtime_gather_output (bool, optional): If True, gather outputs across pipeline stages. loss_mask (Tensor, optional): Mask for loss computation.

Returns:: Model output (e.g., logits or loss, depending on mode).
Return type:: Tensor

.. note::

If pre_process is False, only the language model is constructed.
The vision tower and projector are only active if pre_process is True.
This class is intended for use within the Megatron-LM framework.

Initialization

set_input_tensor(input_tensor) → None#: Set model chunk input tensor.

forward( input_ids: torch.LongTensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, inputs_embeds: Optional[torch.FloatTensor] = None, pixel_values: Optional[torch.Tensor] = None, labels: Optional[torch.Tensor] = None, runtime_gather_output: Optional[bool] = None, *, loss_mask: Optional[torch.Tensor] = None, ) → torch.Tensor#: image_grid_thw (torch.LongTensor of shape (num_images, 3), optional): The temporal, height and width of feature shape of each image in LLM.

freeze( freeze_language_model: bool, freeze_vision_model: bool, freeze_vision_projection: bool, )#

Freeze model modules.

Make specific modules non-trainable by setting requires_grad to False.

Parameters:

freeze_language_model (bool) – Freeze the language model module.
freeze_vision_model (bool) – Freeze the vision model module (patch_embed and blocks).
freeze_vision_projection (bool) – Freeze the vision projection module (merger).

_compute_attention_mask( input_ids: torch.Tensor, ) → Tuple[torch.Tensor, torch.Tensor]#

class bridge.models.gemma_vl.modeling_gemma3_vl.Gemma3VLMultimodalProjectorConfig#

Bases: megatron.core.transformer.TransformerConfig

Gemma3 VL multimodal projector config

input_size: int#: 1152

hidden_size: int#: 2560

image_size: int#: 896

patch_dim: int#: 14

tokens_per_image: int#: 256

normalization: str#: ‘RMSNorm’

layernorm_zero_centered_gamma: bool#: True

layernorm_epsilon: float#: 1e-06

num_layers: int#: 1

num_attention_heads: int#: 8

configure_model() → bridge.models.gemma_vl.modeling_gemma3_vl.Gemma3VLMultimodalProjector#: Get module

class bridge.models.gemma_vl.modeling_gemma3_vl.Gemma3VLMultimodalProjector( config: megatron.core.transformer.TransformerConfig, )#