nemo_automodel.components.models.kimi_k25_vl.model
nemo_automodel.components.models.kimi_k25_vl.model
KimiK25VL model with backend-aware DeepseekV3 language model.
This is a self-contained implementation that includes all necessary components:
- Configuration classes
- Vision tower (MoonViT3d with temporal dimension)
- Multi-modal projector (PatchMergerMLP)
- Language model backend (DeepseekV3)
Module Contents
Classes
Functions
Data
API
Callable adapter that wraps DeepseekV3’s freqs_cis-based RoPE.
Bases: PretrainedConfig
Configuration for KimiK25VL model.
Supports both ‘kimi_k25_vl’ and ‘kimi_k25’ model types for compatibility with original checkpoints.
Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin
KimiK25VL model with backend-aware DeepseekV3 language model.
Load model from pretrained path.
Creates the model structure. Weights are loaded by DCP which calls the state_dict_adapter.to_hf() to get checkpoint-format keys (including _packed/_scale/*_shape for INT4), then from_hf() to dequantize.
Bases: Module
Backend-aware language model wrapper using DeepseekV3 architecture.
Bases: Module
KimiK25VL multimodal backbone with a DeepseekV3 text decoder.
Pre-compute number of image tokens from grid_thws without running vision tower.
For 1 image per sample: num_tokens = (h // merge_h) * (w // merge_w) With default merge_kernel_size=(2,2): num_tokens = (h // 2) * (w // 2)
Parameters:
Tensor of shape (batch_size, 3) with [t, h, w] per sample
Returns: List[int]
List of expected image token counts per sample
Extract and project image features.
Merge image features into input embeddings.
Supports two modes:
- Pre-expanded (PP mode): input_ids already has N placeholder tokens per image, where N = number of image features. Does simple 1:1 replacement.
- Dynamic expansion: input_ids has 1 placeholder per image, expands to N tokens.
Parameters:
List of image feature tensors, one per image
Text embeddings (batch_size, seq_len, embed_dim)
Token IDs (batch_size, seq_len)
Attention mask (batch_size, seq_len)
Optional labels for training
Optional fixed output length for pipeline parallelism.
Forward pass with optional fixed sequence length for pipeline parallelism.
Parameters:
If provided, the output after image token expansion will be padded to this fixed length. Required for PP with varying image sizes. Can be pre-computed as: max_text_len - 1 + max_image_tokens where max_image_tokens = (h // 2) * (w // 2) for each image.
Bases: Module
Projects vision features to language model dimension using patch merger MLP.
Bases: Module
Learnable 2D interpolatable position embedding with fixed temporal sincos embedding.
Bases: PretrainedConfig
Configuration for MoonViT3d vision encoder with temporal support.
Bases: Module
MoonViT3d encoder with temporal support.
Bases: Module
Single encoder layer for MoonViT3d.
Bases: Module
MLP for MoonViT3d.
Bases: Module
MoonViT3d vision encoder with temporal support.
Bases: Module
Patch embedding for MoonViT3d.
Bases: Module
2D rotary position embedding repeated for temporal dimension.
Apply rotary position embedding for vision.
Register KimiK25VLConfig and model with transformers Auto classes.
Compute the expanded sequence length after image token insertion.
For pipeline parallelism, this can be used to pre-compute the target_seq_length parameter needed for fixed-shape outputs.
Parameters:
Original text sequence length (including 1 placeholder per image)
Tensor of shape (num_images, 3) with [t, h, w] per image
Vision tower’s patch merge kernel size, default (2, 2)
Number of images (placeholders) in the sequence
Returns: int
Expected sequence length after image features are inserted
Generate 1D sinusoidal positional embedding for temporal dimension.
Generate 1D sinusoidal positional embedding from grid positions.
Merge patches with temporal pooling.
Flash attention for vision.
SDPA attention for vision.