nemo_automodel.components.models.kimi_k25_vl.model#

KimiK25VL model with backend-aware DeepseekV3 language model.

This is a self-contained implementation that includes all necessary components:

  • Configuration classes

  • Vision tower (MoonViT3d with temporal dimension)

  • Multi-modal projector (PatchMergerMLP)

  • Language model backend (DeepseekV3)

Module Contents#

Classes#

MoonViT3dConfig

Configuration for MoonViT3d vision encoder with temporal support.

KimiK25VLConfig

Configuration for KimiK25VL model.

Learnable2DInterpPosEmbDividedFixed

Learnable 2D interpolatable position embedding with fixed temporal sincos embedding.

Rope2DPosEmbRepeated

2D rotary position embedding repeated for temporal dimension.

MoonViT3dMLP

MLP for MoonViT3d.

MoonViT3dEncoderLayer

Single encoder layer for MoonViT3d.

MoonViT3dEncoder

MoonViT3d encoder with temporal support.

MoonVision3dPatchEmbed

Patch embedding for MoonViT3d.

MoonViT3dPretrainedModel

MoonViT3d vision encoder with temporal support.

KimiK25VLMultiModalProjector

Projects vision features to language model dimension using patch merger MLP.

DeepSeekV3RotaryEmbeddingAdapter

Callable adapter that wraps DeepseekV3’s freqs_cis-based RoPE.

KimiK25VLLanguageModelBackend

Backend-aware language model wrapper using DeepseekV3 architecture.

KimiK25VLModel

KimiK25VL multimodal backbone with a DeepseekV3 text decoder.

KimiK25VLForConditionalGeneration

KimiK25VL model with backend-aware DeepseekV3 language model.

Functions#

get_1d_sincos_pos_embed_from_grid

Generate 1D sinusoidal positional embedding from grid positions.

get_1d_sincos_pos_embed

Generate 1D sinusoidal positional embedding for temporal dimension.

_apply_rope_vision

Apply rotary position embedding for vision.

vision_attention_flash

Flash attention for vision.

vision_attention_sdpa

SDPA attention for vision.

tpool_patch_merger

Merge patches with temporal pooling.

_register_kimi_k25_vl_with_transformers

Register KimiK25VLConfig and model with transformers Auto classes.

compute_expanded_seq_length

Compute the expanded sequence length after image token insertion.

Data#

API#

nemo_automodel.components.models.kimi_k25_vl.model.LOGGER#

‘getLogger(…)’

class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dConfig(
patch_size: int = 14,
init_pos_emb_height: int = 64,
init_pos_emb_width: int = 64,
init_pos_emb_time: int = 4,
pos_emb_type: str = 'divided_fixed',
num_attention_heads: int = 16,
num_hidden_layers: int = 27,
hidden_size: int = 1152,
intermediate_size: int = 4304,
merge_kernel_size: Tuple[int, int] = (2, 2),
video_attn_type: str = 'spatial_temporal',
merge_type: str = 'sd2_tpool',
**kwargs,
)#

Bases: transformers.configuration_utils.PretrainedConfig

Configuration for MoonViT3d vision encoder with temporal support.

Initialization

model_type#

‘moonvit3d’

class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLConfig(
vision_config: Optional[Union[Dict, nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dConfig]] = None,
text_config: Optional[Union[Dict, transformers.models.deepseek_v3.configuration_deepseek_v3.DeepseekV3Config]] = None,
ignore_index: int = -100,
media_placeholder_token_id: int = 163605,
pad_token_id: int = 0,
tie_word_embeddings: bool = False,
mm_projector_type: str = 'patchmerger',
mm_hidden_size: Optional[int] = None,
projector_hidden_act: str = 'gelu',
projector_ln_eps: float = 1e-05,
architectures: Optional[List[str]] = None,
**kwargs,
)#

Bases: transformers.configuration_utils.PretrainedConfig

Configuration for KimiK25VL model.

Supports both ‘kimi_k25_vl’ and ‘kimi_k25’ model types for compatibility with original checkpoints.

Initialization

model_type#

‘kimi_k25’

to_dict() Dict[str, Any]#
nemo_automodel.components.models.kimi_k25_vl.model.get_1d_sincos_pos_embed_from_grid(
embed_dim: int,
pos: numpy.ndarray,
) numpy.ndarray#

Generate 1D sinusoidal positional embedding from grid positions.

nemo_automodel.components.models.kimi_k25_vl.model.get_1d_sincos_pos_embed(
embed_dim: int,
t_size: int,
cls_token: bool = False,
) numpy.ndarray#

Generate 1D sinusoidal positional embedding for temporal dimension.

nemo_automodel.components.models.kimi_k25_vl.model._apply_rope_vision(
xq: torch.Tensor,
xk: torch.Tensor,
freqs_cis: torch.Tensor,
) Tuple[torch.Tensor, torch.Tensor]#

Apply rotary position embedding for vision.

nemo_automodel.components.models.kimi_k25_vl.model.vision_attention_flash(
q,
k,
v,
q_cu_seqlens,
k_cu_seqlens,
max_seqlen_q=None,
max_seqlen_k=None,
)#

Flash attention for vision.

nemo_automodel.components.models.kimi_k25_vl.model.vision_attention_sdpa(q, k, v, q_cu_seqlens, k_cu_seqlens, **kwargs)#

SDPA attention for vision.

class nemo_automodel.components.models.kimi_k25_vl.model.Learnable2DInterpPosEmbDividedFixed(
height: int,
width: int,
num_frames: int,
dim: int,
interpolation_mode: str = 'bicubic',
)#

Bases: torch.nn.Module

Learnable 2D interpolatable position embedding with fixed temporal sincos embedding.

Initialization

forward(x: torch.Tensor, grid_thws: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.kimi_k25_vl.model.Rope2DPosEmbRepeated(
dim: int,
max_height: int,
max_width: int,
theta_base: float = 10000,
)#

Bases: torch.nn.Module

2D rotary position embedding repeated for temporal dimension.

Initialization

_precompute_freqs_cis(device: torch.device) torch.Tensor#
get_freqs_cis(
grid_thws: torch.Tensor,
device: torch.device,
) torch.Tensor#
class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dMLP(dims: List[int], activation, bias: bool = True)#

Bases: torch.nn.Module

MLP for MoonViT3d.

Initialization

forward(x: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dEncoderLayer(
num_heads: int,
hidden_dim: int,
mlp_dim: int,
*,
activation=F.gelu,
attn_bias: bool = False,
attn_implementation: str = 'flash_attention_2',
)#

Bases: torch.nn.Module

Single encoder layer for MoonViT3d.

Initialization

forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
max_seqlen: int,
rope_freqs_cis: torch.Tensor,
) torch.Tensor#
class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dEncoder(hidden_dim: int, num_layers: int, block_cfg: dict)#

Bases: torch.nn.Module

MoonViT3d encoder with temporal support.

Initialization

forward(
hidden_states: torch.Tensor,
grid_thws: torch.Tensor,
) torch.Tensor#
class nemo_automodel.components.models.kimi_k25_vl.model.MoonVision3dPatchEmbed(
out_dim: int,
in_dim: int = 3,
patch_size: int = 14,
pos_emb_height: int = 64,
pos_emb_width: int = 64,
pos_emb_time: int = 4,
)#

Bases: torch.nn.Module

Patch embedding for MoonViT3d.

Initialization

forward(x: torch.Tensor, grid_thws: torch.Tensor) torch.Tensor#
nemo_automodel.components.models.kimi_k25_vl.model.tpool_patch_merger(
x: torch.Tensor,
grid_thws: torch.Tensor,
merge_kernel_size: List[int] = [2, 2],
) List[torch.Tensor]#

Merge patches with temporal pooling.

class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dPretrainedModel(config)#

Bases: torch.nn.Module

MoonViT3d vision encoder with temporal support.

Initialization

property dtype#
forward(
pixel_values: torch.Tensor,
grid_thws: torch.Tensor,
) List[torch.Tensor]#
class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLMultiModalProjector(config)#

Bases: torch.nn.Module

Projects vision features to language model dimension using patch merger MLP.

Initialization

forward(
image_features: List[torch.Tensor],
) List[torch.Tensor]#
class nemo_automodel.components.models.kimi_k25_vl.model.DeepSeekV3RotaryEmbeddingAdapter(
parent_module: torch.nn.Module,
rope_fusion: bool = False,
)#

Callable adapter that wraps DeepseekV3’s freqs_cis-based RoPE.

Initialization

property freqs_cis#
__call__(
hidden_states: torch.Tensor,
position_ids: torch.Tensor,
) torch.Tensor#
class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLLanguageModelBackend(
config,
backend: nemo_automodel.components.models.common.BackendConfig,
*,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
)#

Bases: torch.nn.Module

Backend-aware language model wrapper using DeepseekV3 architecture.

Initialization

get_input_embeddings()#
set_input_embeddings(value)#
forward(
input_ids=None,
*,
inputs_embeds=None,
attention_mask=None,
position_ids=None,
padding_mask=None,
**kwargs,
)#
init_weights(buffer_device=None)#
property embed_tokens#
property layers#
property norm#
class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLModel(
config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
)#

Bases: torch.nn.Module

KimiK25VL multimodal backbone with a DeepseekV3 text decoder.

Initialization

property layers#
property embed_tokens#
property norm#
_compute_num_image_tokens_from_grid(
grid_thws: torch.Tensor,
) List[int]#

Pre-compute number of image tokens from grid_thws without running vision tower.

For 1 image per sample: num_tokens = (h // merge_h) * (w // merge_w) With default merge_kernel_size=(2,2): num_tokens = (h // 2) * (w // 2)

Parameters:

grid_thws – Tensor of shape (batch_size, 3) with [t, h, w] per sample

Returns:

List of expected image token counts per sample

_merge_input_ids_with_image_features(
image_features: List[torch.Tensor],
inputs_embeds: torch.Tensor,
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
labels: Optional[torch.Tensor] = None,
target_seq_length: Optional[int] = None,
)#

Merge image features into input embeddings.

Supports two modes:

  1. Pre-expanded (PP mode): input_ids already has N placeholder tokens per image, where N = number of image features. Does simple 1:1 replacement.

  2. Dynamic expansion: input_ids has 1 placeholder per image, expands to N tokens.

Parameters:
  • image_features – List of image feature tensors, one per image

  • inputs_embeds – Text embeddings (batch_size, seq_len, embed_dim)

  • input_ids – Token IDs (batch_size, seq_len)

  • attention_mask – Attention mask (batch_size, seq_len)

  • labels – Optional labels for training

  • target_seq_length – Optional fixed output length for pipeline parallelism.

_extract_image_features(pixel_values, grid_thws)#

Extract and project image features.

forward(
input_ids=None,
attention_mask=None,
position_ids=None,
inputs_embeds=None,
pixel_values=None,
grid_thws=None,
labels=None,
padding_mask=None,
target_seq_length=None,
**kwargs,
)#

Forward pass with optional fixed sequence length for pipeline parallelism.

Parameters:

target_seq_length – If provided, the output after image token expansion will be padded to this fixed length. Required for PP with varying image sizes. Can be pre-computed as: max_text_len - 1 + max_image_tokens where max_image_tokens = (h // 2) * (w // 2) for each image.

class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLForConditionalGeneration(
config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs,
)#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, torch.nn.Module, nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixin

KimiK25VL model with backend-aware DeepseekV3 language model.

Initialization

config_class#

None

base_model_prefix#

‘model’

main_input_name#

‘pixel_values’

_no_split_modules#

[‘MoonViT3dEncoderLayer’]

supports_gradient_checkpointing#

True

classmethod from_config(
config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs,
)#
classmethod from_pretrained(
pretrained_model_name_or_path: str,
*model_args,
**kwargs,
)#

Load model from pretrained path.

Creates the model structure. Weights are loaded by DCP which calls the state_dict_adapter.to_hf() to get checkpoint-format keys (including _packed/_scale/*_shape for INT4), then from_hf() to dequantize.

property dtype#
get_input_embeddings()#
set_input_embeddings(value)#
get_output_embeddings()#
set_output_embeddings(new_embeddings)#
property lm_head#
forward(
input_ids=None,
attention_mask=None,
position_ids=None,
past_key_values=None,
inputs_embeds=None,
labels=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
pixel_values=None,
grid_thws=None,
padding_mask=None,
target_seq_length=None,
**kwargs,
)#
initialize_weights(buffer_device=None, dtype=torch.bfloat16)#
nemo_automodel.components.models.kimi_k25_vl.model.ModelClass#

None

nemo_automodel.components.models.kimi_k25_vl.model._register_kimi_k25_vl_with_transformers()#

Register KimiK25VLConfig and model with transformers Auto classes.

nemo_automodel.components.models.kimi_k25_vl.model.compute_expanded_seq_length(
text_seq_length: int,
grid_thws: torch.Tensor,
merge_kernel_size: Tuple[int, int] = (2, 2),
num_images: int = 1,
) int#

Compute the expanded sequence length after image token insertion.

For pipeline parallelism, this can be used to pre-compute the target_seq_length parameter needed for fixed-shape outputs.

Parameters:
  • text_seq_length – Original text sequence length (including 1 placeholder per image)

  • grid_thws – Tensor of shape (num_images, 3) with [t, h, w] per image

  • merge_kernel_size – Vision tower’s patch merge kernel size, default (2, 2)

  • num_images – Number of images (placeholders) in the sequence

Returns:

Expected sequence length after image features are inserted

.. rubric:: Example

For 1 image per sample with grid_thws = [[1, 28, 28]]:#

num_image_tokens = (28 // 2) * (28 // 2) = 196#

expanded_length = text_seq_length - 1 + 196#

grid_thws = torch.tensor([[1, 28, 28]]) compute_expanded_seq_length(82, grid_thws) 277 # 82 - 1 + 196