nemo_automodel.components.models.kimi_k25_vl.model#
KimiK25VL model with backend-aware DeepseekV3 language model.
This is a self-contained implementation that includes all necessary components:
Configuration classes
Vision tower (MoonViT3d with temporal dimension)
Multi-modal projector (PatchMergerMLP)
Language model backend (DeepseekV3)
Module Contents#
Classes#
Configuration for MoonViT3d vision encoder with temporal support. |
|
Configuration for KimiK25VL model. |
|
Learnable 2D interpolatable position embedding with fixed temporal sincos embedding. |
|
2D rotary position embedding repeated for temporal dimension. |
|
MLP for MoonViT3d. |
|
Single encoder layer for MoonViT3d. |
|
MoonViT3d encoder with temporal support. |
|
Patch embedding for MoonViT3d. |
|
MoonViT3d vision encoder with temporal support. |
|
Projects vision features to language model dimension using patch merger MLP. |
|
Callable adapter that wraps DeepseekV3’s freqs_cis-based RoPE. |
|
Backend-aware language model wrapper using DeepseekV3 architecture. |
|
KimiK25VL multimodal backbone with a DeepseekV3 text decoder. |
|
KimiK25VL model with backend-aware DeepseekV3 language model. |
Functions#
Generate 1D sinusoidal positional embedding from grid positions. |
|
Generate 1D sinusoidal positional embedding for temporal dimension. |
|
Apply rotary position embedding for vision. |
|
Flash attention for vision. |
|
SDPA attention for vision. |
|
Merge patches with temporal pooling. |
|
Register KimiK25VLConfig and model with transformers Auto classes. |
|
Compute the expanded sequence length after image token insertion. |
Data#
API#
- nemo_automodel.components.models.kimi_k25_vl.model.LOGGER#
‘getLogger(…)’
- class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dConfig(
- patch_size: int = 14,
- init_pos_emb_height: int = 64,
- init_pos_emb_width: int = 64,
- init_pos_emb_time: int = 4,
- pos_emb_type: str = 'divided_fixed',
- num_attention_heads: int = 16,
- num_hidden_layers: int = 27,
- hidden_size: int = 1152,
- intermediate_size: int = 4304,
- merge_kernel_size: Tuple[int, int] = (2, 2),
- video_attn_type: str = 'spatial_temporal',
- merge_type: str = 'sd2_tpool',
- **kwargs,
Bases:
transformers.configuration_utils.PretrainedConfigConfiguration for MoonViT3d vision encoder with temporal support.
Initialization
- model_type#
‘moonvit3d’
- class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLConfig(
- vision_config: Optional[Union[Dict, nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dConfig]] = None,
- text_config: Optional[Union[Dict, transformers.models.deepseek_v3.configuration_deepseek_v3.DeepseekV3Config]] = None,
- ignore_index: int = -100,
- media_placeholder_token_id: int = 163605,
- pad_token_id: int = 0,
- tie_word_embeddings: bool = False,
- mm_projector_type: str = 'patchmerger',
- mm_hidden_size: Optional[int] = None,
- projector_hidden_act: str = 'gelu',
- projector_ln_eps: float = 1e-05,
- architectures: Optional[List[str]] = None,
- **kwargs,
Bases:
transformers.configuration_utils.PretrainedConfigConfiguration for KimiK25VL model.
Supports both ‘kimi_k25_vl’ and ‘kimi_k25’ model types for compatibility with original checkpoints.
Initialization
- model_type#
‘kimi_k25’
- to_dict() Dict[str, Any]#
- nemo_automodel.components.models.kimi_k25_vl.model.get_1d_sincos_pos_embed_from_grid(
- embed_dim: int,
- pos: numpy.ndarray,
Generate 1D sinusoidal positional embedding from grid positions.
- nemo_automodel.components.models.kimi_k25_vl.model.get_1d_sincos_pos_embed(
- embed_dim: int,
- t_size: int,
- cls_token: bool = False,
Generate 1D sinusoidal positional embedding for temporal dimension.
- nemo_automodel.components.models.kimi_k25_vl.model._apply_rope_vision(
- xq: torch.Tensor,
- xk: torch.Tensor,
- freqs_cis: torch.Tensor,
Apply rotary position embedding for vision.
- nemo_automodel.components.models.kimi_k25_vl.model.vision_attention_flash(
- q,
- k,
- v,
- q_cu_seqlens,
- k_cu_seqlens,
- max_seqlen_q=None,
- max_seqlen_k=None,
Flash attention for vision.
- nemo_automodel.components.models.kimi_k25_vl.model.vision_attention_sdpa(q, k, v, q_cu_seqlens, k_cu_seqlens, **kwargs)#
SDPA attention for vision.
- class nemo_automodel.components.models.kimi_k25_vl.model.Learnable2DInterpPosEmbDividedFixed(
- height: int,
- width: int,
- num_frames: int,
- dim: int,
- interpolation_mode: str = 'bicubic',
Bases:
torch.nn.ModuleLearnable 2D interpolatable position embedding with fixed temporal sincos embedding.
Initialization
- forward(x: torch.Tensor, grid_thws: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.kimi_k25_vl.model.Rope2DPosEmbRepeated(
- dim: int,
- max_height: int,
- max_width: int,
- theta_base: float = 10000,
Bases:
torch.nn.Module2D rotary position embedding repeated for temporal dimension.
Initialization
- _precompute_freqs_cis(device: torch.device) torch.Tensor#
- get_freqs_cis(
- grid_thws: torch.Tensor,
- device: torch.device,
- class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dMLP(dims: List[int], activation, bias: bool = True)#
Bases:
torch.nn.ModuleMLP for MoonViT3d.
Initialization
- forward(x: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dEncoderLayer(
- num_heads: int,
- hidden_dim: int,
- mlp_dim: int,
- *,
- activation=F.gelu,
- attn_bias: bool = False,
- attn_implementation: str = 'flash_attention_2',
Bases:
torch.nn.ModuleSingle encoder layer for MoonViT3d.
Initialization
- forward(
- hidden_states: torch.Tensor,
- cu_seqlens: torch.Tensor,
- max_seqlen: int,
- rope_freqs_cis: torch.Tensor,
- class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dEncoder(hidden_dim: int, num_layers: int, block_cfg: dict)#
Bases:
torch.nn.ModuleMoonViT3d encoder with temporal support.
Initialization
- forward(
- hidden_states: torch.Tensor,
- grid_thws: torch.Tensor,
- class nemo_automodel.components.models.kimi_k25_vl.model.MoonVision3dPatchEmbed(
- out_dim: int,
- in_dim: int = 3,
- patch_size: int = 14,
- pos_emb_height: int = 64,
- pos_emb_width: int = 64,
- pos_emb_time: int = 4,
Bases:
torch.nn.ModulePatch embedding for MoonViT3d.
Initialization
- forward(x: torch.Tensor, grid_thws: torch.Tensor) torch.Tensor#
- nemo_automodel.components.models.kimi_k25_vl.model.tpool_patch_merger(
- x: torch.Tensor,
- grid_thws: torch.Tensor,
- merge_kernel_size: List[int] = [2, 2],
Merge patches with temporal pooling.
- class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dPretrainedModel(config)#
Bases:
torch.nn.ModuleMoonViT3d vision encoder with temporal support.
Initialization
- property dtype#
- forward(
- pixel_values: torch.Tensor,
- grid_thws: torch.Tensor,
- class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLMultiModalProjector(config)#
Bases:
torch.nn.ModuleProjects vision features to language model dimension using patch merger MLP.
Initialization
- forward(
- image_features: List[torch.Tensor],
- class nemo_automodel.components.models.kimi_k25_vl.model.DeepSeekV3RotaryEmbeddingAdapter(
- parent_module: torch.nn.Module,
- rope_fusion: bool = False,
Callable adapter that wraps DeepseekV3’s freqs_cis-based RoPE.
Initialization
- property freqs_cis#
- __call__(
- hidden_states: torch.Tensor,
- position_ids: torch.Tensor,
- class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLLanguageModelBackend(
- config,
- backend: nemo_automodel.components.models.common.BackendConfig,
- *,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
Bases:
torch.nn.ModuleBackend-aware language model wrapper using DeepseekV3 architecture.
Initialization
- get_input_embeddings()#
- set_input_embeddings(value)#
- forward(
- input_ids=None,
- *,
- inputs_embeds=None,
- attention_mask=None,
- position_ids=None,
- padding_mask=None,
- **kwargs,
- init_weights(buffer_device=None)#
- property embed_tokens#
- property layers#
- property norm#
- class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLModel(
- config,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
Bases:
torch.nn.ModuleKimiK25VL multimodal backbone with a DeepseekV3 text decoder.
Initialization
- property layers#
- property embed_tokens#
- property norm#
- _compute_num_image_tokens_from_grid(
- grid_thws: torch.Tensor,
Pre-compute number of image tokens from grid_thws without running vision tower.
For 1 image per sample: num_tokens = (h // merge_h) * (w // merge_w) With default merge_kernel_size=(2,2): num_tokens = (h // 2) * (w // 2)
- Parameters:
grid_thws – Tensor of shape (batch_size, 3) with [t, h, w] per sample
- Returns:
List of expected image token counts per sample
- _merge_input_ids_with_image_features(
- image_features: List[torch.Tensor],
- inputs_embeds: torch.Tensor,
- input_ids: torch.Tensor,
- attention_mask: torch.Tensor,
- labels: Optional[torch.Tensor] = None,
- target_seq_length: Optional[int] = None,
Merge image features into input embeddings.
Supports two modes:
Pre-expanded (PP mode): input_ids already has N placeholder tokens per image, where N = number of image features. Does simple 1:1 replacement.
Dynamic expansion: input_ids has 1 placeholder per image, expands to N tokens.
- Parameters:
image_features – List of image feature tensors, one per image
inputs_embeds – Text embeddings (batch_size, seq_len, embed_dim)
input_ids – Token IDs (batch_size, seq_len)
attention_mask – Attention mask (batch_size, seq_len)
labels – Optional labels for training
target_seq_length – Optional fixed output length for pipeline parallelism.
- _extract_image_features(pixel_values, grid_thws)#
Extract and project image features.
- forward(
- input_ids=None,
- attention_mask=None,
- position_ids=None,
- inputs_embeds=None,
- pixel_values=None,
- grid_thws=None,
- labels=None,
- padding_mask=None,
- target_seq_length=None,
- **kwargs,
Forward pass with optional fixed sequence length for pipeline parallelism.
- Parameters:
target_seq_length – If provided, the output after image token expansion will be padded to this fixed length. Required for PP with varying image sizes. Can be pre-computed as: max_text_len - 1 + max_image_tokens where max_image_tokens = (h // 2) * (w // 2) for each image.
- class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLForConditionalGeneration(
- config,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
Bases:
nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin,torch.nn.Module,nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixinKimiK25VL model with backend-aware DeepseekV3 language model.
Initialization
- config_class#
None
- base_model_prefix#
‘model’
- main_input_name#
‘pixel_values’
- _no_split_modules#
[‘MoonViT3dEncoderLayer’]
- supports_gradient_checkpointing#
True
- classmethod from_config(
- config,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
- classmethod from_pretrained(
- pretrained_model_name_or_path: str,
- *model_args,
- **kwargs,
Load model from pretrained path.
Creates the model structure. Weights are loaded by DCP which calls the state_dict_adapter.to_hf() to get checkpoint-format keys (including _packed/_scale/*_shape for INT4), then from_hf() to dequantize.
- property dtype#
- get_input_embeddings()#
- set_input_embeddings(value)#
- get_output_embeddings()#
- set_output_embeddings(new_embeddings)#
- property lm_head#
- forward(
- input_ids=None,
- attention_mask=None,
- position_ids=None,
- past_key_values=None,
- inputs_embeds=None,
- labels=None,
- use_cache=None,
- output_attentions=None,
- output_hidden_states=None,
- return_dict=None,
- pixel_values=None,
- grid_thws=None,
- padding_mask=None,
- target_seq_length=None,
- **kwargs,
- initialize_weights(buffer_device=None, dtype=torch.bfloat16)#
- nemo_automodel.components.models.kimi_k25_vl.model.ModelClass#
None
- nemo_automodel.components.models.kimi_k25_vl.model._register_kimi_k25_vl_with_transformers()#
Register KimiK25VLConfig and model with transformers Auto classes.
- nemo_automodel.components.models.kimi_k25_vl.model.compute_expanded_seq_length(
- text_seq_length: int,
- grid_thws: torch.Tensor,
- merge_kernel_size: Tuple[int, int] = (2, 2),
- num_images: int = 1,
Compute the expanded sequence length after image token insertion.
For pipeline parallelism, this can be used to pre-compute the target_seq_length parameter needed for fixed-shape outputs.
- Parameters:
text_seq_length – Original text sequence length (including 1 placeholder per image)
grid_thws – Tensor of shape (num_images, 3) with [t, h, w] per image
merge_kernel_size – Vision tower’s patch merge kernel size, default (2, 2)
num_images – Number of images (placeholders) in the sequence
- Returns:
Expected sequence length after image features are inserted
.. rubric:: Example
For 1 image per sample with grid_thws = [[1, 28, 28]]:#
num_image_tokens = (28 // 2) * (28 // 2) = 196#
expanded_length = text_seq_length - 1 + 196#
grid_thws = torch.tensor([[1, 28, 28]]) compute_expanded_seq_length(82, grid_thws) 277 # 82 - 1 + 196