bridge.models.kimi_vl.modeling_kimi_k25_vl#

Module Contents#

Classes#

KimiK25VLModel

Kimi K2.5 Vision-Language (VL) model wrapper for Megatron.

Data#

API#

bridge.models.kimi_vl.modeling_kimi_k25_vl.logger#

β€˜getLogger(…)’

class bridge.models.kimi_vl.modeling_kimi_k25_vl.KimiK25VLModel(
config: megatron.bridge.models.gpt_provider.GPTModelProvider,
pre_process: bool = True,
post_process: bool = True,
vp_stage: Optional[int] = None,
)#

Bases: megatron.core.transformer.module.MegatronModule

Kimi K2.5 Vision-Language (VL) model wrapper for Megatron.

Parameters:
  • config (GPTModelProvider) – Model provider containing configuration for language and vision modules.

  • pre_process (bool, optional) – Whether to construct the vision tower and projector. Default: True.

  • post_process (bool, optional) – Whether to apply post-processing. Default: True.

  • vp_stage (Optional[int], optional) – Pipeline stage for model parallelism. Default: None.

.. attribute:: pre_process

If True, enables vision and multimodal components.

Type:

bool

.. attribute:: post_process

If True, enables post-processing.

Type:

bool

.. attribute:: vp_stage

Pipeline stage for model parallelism.

Type:

Optional[int]

.. attribute:: vision_tower

Vision encoder (MoonViT3d vision backbone).

Type:

nn.Module

.. attribute:: mm_projector

PatchMergerMLP that projects vision features to language model space.

Type:

nn.Module

.. attribute:: language_model

The underlying Kimi K2 language model.

Type:

nn.Module

.. attribute:: get_image_features

Method to extract and project image features.

Type:

callable

Forward Inputs: input_ids (torch.LongTensor, optional): Tokenized input ids for the language model. attention_mask (torch.Tensor, optional): Attention mask for the language model. position_ids (torch.LongTensor, optional): Position ids for the language model. inputs_embeds (torch.FloatTensor, optional): Precomputed input embeddings. pixel_values (torch.Tensor, optional): Image tensor(s) for the vision tower. labels (torch.Tensor, optional): Target labels for supervised training. runtime_gather_output (bool, optional): If True, gather outputs across pipeline stages. loss_mask (Tensor, optional): Mask for loss computation.

Returns:

Model output (e.g., logits or loss, depending on mode).

Return type:

Tensor

.. note::

  • If pre_process is False, only the language model is constructed.

  • The vision tower and projector are only active if pre_process is True.

  • This class is intended for use within the Megatron-LM framework.

Initialization

set_input_tensor(input_tensor) None#

Set model chunk input tensor.

_merge_input_ids_with_image_features(
image_features: List[torch.Tensor],
inputs_embeds: torch.Tensor,
input_ids: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
labels: Optional[torch.Tensor] = None,
target_seq_length: Optional[int] = None,
)#

Merge image features into input embeddings.

Supports two modes:

  1. Pre-expanded (PP mode): input_ids already has N placeholder tokens per image, where N = number of image features. Does simple 1:1 replacement.

  2. Dynamic expansion: input_ids has 1 placeholder per image, expands to N tokens.

Parameters:
  • image_features – List of image feature tensors, one per image

  • inputs_embeds – Text embeddings (batch_size, seq_len, embed_dim)

  • input_ids – Token IDs (batch_size, seq_len)

  • attention_mask – Attention mask (batch_size, seq_len)

  • labels – Optional labels for training

  • target_seq_length – Optional fixed output length for pipeline parallelism.

_extract_image_features(pixel_values, grid_thws)#

Extract and project image features.

forward(
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
pixel_values: Optional[torch.Tensor] = None,
image_grid_thw: Optional[torch.Tensor] = None,
labels: Optional[torch.Tensor] = None,
runtime_gather_output: Optional[bool] = None,
*,
loss_mask: Optional[torch.Tensor] = None,
packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
) torch.Tensor#
Parameters:
  • input_ids – Tokenized input ids for the language model.

  • attention_mask – Attention mask for the language model.

  • position_ids – Position ids for the language model.

  • inputs_embeds – Precomputed input embeddings.

  • pixel_values – Image tensor for the vision tower.

  • image_grid_thw – Tensor of shape (num_images, 3) containing [temporal, height, width] for each image’s grid dimensions in the LLM. This corresponds to grid_thws in the HF Kimi K2.5 processor output.

  • labels – Target labels for supervised training.

  • runtime_gather_output – If True, gather outputs across pipeline stages.

  • loss_mask – Mask for loss computation.

.. note::

For _merge_input_ids_with_image_features, there are two modes for processing input_ids:

  1. Pre-expanded (PP mode): input_ids already has N placeholder tokens per image, where N = number of image features. Does simple 1:1 replacement.

  2. Dynamic expansion: input_ids has 1 placeholder per image, expands to N tokens.

freeze(
freeze_language_model: bool,
freeze_vision_model: bool,
freeze_vision_projection: bool,
)#

Freeze model modules.

Make specific modules non-trainable by setting requires_grad to False.

Parameters:
  • freeze_language_model (bool) – Freeze the language model module.

  • freeze_vision_model (bool) – Freeze the vision model module (patch_embed and blocks).

  • freeze_vision_projection (bool) – Freeze the vision projection module (merger).