bridge.models.kimi_vl.modeling_kimi_k25_vl#
Module Contents#
Classes#
Kimi K2.5 Vision-Language (VL) model wrapper for Megatron. |
Data#
API#
- bridge.models.kimi_vl.modeling_kimi_k25_vl.logger#
βgetLogger(β¦)β
- class bridge.models.kimi_vl.modeling_kimi_k25_vl.KimiK25VLModel(
- config: megatron.bridge.models.gpt_provider.GPTModelProvider,
- pre_process: bool = True,
- post_process: bool = True,
- vp_stage: Optional[int] = None,
Bases:
megatron.core.transformer.module.MegatronModuleKimi K2.5 Vision-Language (VL) model wrapper for Megatron.
- Parameters:
config (GPTModelProvider) β Model provider containing configuration for language and vision modules.
pre_process (bool, optional) β Whether to construct the vision tower and projector. Default: True.
post_process (bool, optional) β Whether to apply post-processing. Default: True.
vp_stage (Optional[int], optional) β Pipeline stage for model parallelism. Default: None.
.. attribute:: pre_process
If True, enables vision and multimodal components.
- Type:
bool
.. attribute:: post_process
If True, enables post-processing.
- Type:
bool
.. attribute:: vp_stage
Pipeline stage for model parallelism.
- Type:
Optional[int]
.. attribute:: vision_tower
Vision encoder (MoonViT3d vision backbone).
- Type:
nn.Module
.. attribute:: mm_projector
PatchMergerMLP that projects vision features to language model space.
- Type:
nn.Module
.. attribute:: language_model
The underlying Kimi K2 language model.
- Type:
nn.Module
.. attribute:: get_image_features
Method to extract and project image features.
- Type:
callable
Forward Inputs: input_ids (torch.LongTensor, optional): Tokenized input ids for the language model. attention_mask (torch.Tensor, optional): Attention mask for the language model. position_ids (torch.LongTensor, optional): Position ids for the language model. inputs_embeds (torch.FloatTensor, optional): Precomputed input embeddings. pixel_values (torch.Tensor, optional): Image tensor(s) for the vision tower. labels (torch.Tensor, optional): Target labels for supervised training. runtime_gather_output (bool, optional): If True, gather outputs across pipeline stages. loss_mask (Tensor, optional): Mask for loss computation.
- Returns:
Model output (e.g., logits or loss, depending on mode).
- Return type:
Tensor
.. note::
If
pre_processis False, only the language model is constructed.The vision tower and projector are only active if
pre_processis True.This class is intended for use within the Megatron-LM framework.
Initialization
- set_input_tensor(input_tensor) None#
Set model chunk input tensor.
- _merge_input_ids_with_image_features(
- image_features: List[torch.Tensor],
- inputs_embeds: torch.Tensor,
- input_ids: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- labels: Optional[torch.Tensor] = None,
- target_seq_length: Optional[int] = None,
Merge image features into input embeddings.
Supports two modes:
Pre-expanded (PP mode): input_ids already has N placeholder tokens per image, where N = number of image features. Does simple 1:1 replacement.
Dynamic expansion: input_ids has 1 placeholder per image, expands to N tokens.
- Parameters:
image_features β List of image feature tensors, one per image
inputs_embeds β Text embeddings (batch_size, seq_len, embed_dim)
input_ids β Token IDs (batch_size, seq_len)
attention_mask β Attention mask (batch_size, seq_len)
labels β Optional labels for training
target_seq_length β Optional fixed output length for pipeline parallelism.
- _extract_image_features(pixel_values, grid_thws)#
Extract and project image features.
- forward(
- input_ids: torch.LongTensor = None,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- pixel_values: Optional[torch.Tensor] = None,
- image_grid_thw: Optional[torch.Tensor] = None,
- labels: Optional[torch.Tensor] = None,
- runtime_gather_output: Optional[bool] = None,
- *,
- loss_mask: Optional[torch.Tensor] = None,
- packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
- Parameters:
input_ids β Tokenized input ids for the language model.
attention_mask β Attention mask for the language model.
position_ids β Position ids for the language model.
inputs_embeds β Precomputed input embeddings.
pixel_values β Image tensor for the vision tower.
image_grid_thw β Tensor of shape
(num_images, 3)containing[temporal, height, width]for each imageβs grid dimensions in the LLM. This corresponds togrid_thwsin the HF Kimi K2.5 processor output.labels β Target labels for supervised training.
runtime_gather_output β If True, gather outputs across pipeline stages.
loss_mask β Mask for loss computation.
.. note::
For _merge_input_ids_with_image_features, there are two modes for processing input_ids:
Pre-expanded (PP mode): input_ids already has N placeholder tokens per image, where N = number of image features. Does simple 1:1 replacement.
Dynamic expansion: input_ids has 1 placeholder per image, expands to N tokens.
- freeze(
- freeze_language_model: bool,
- freeze_vision_model: bool,
- freeze_vision_projection: bool,
Freeze model modules.
Make specific modules non-trainable by setting requires_grad to False.
- Parameters:
freeze_language_model (bool) β Freeze the language model module.
freeze_vision_model (bool) β Freeze the vision model module (patch_embed and blocks).
freeze_vision_projection (bool) β Freeze the vision projection module (merger).