`nemo_automodel.components.models.kimi_k25_vl.model`#

KimiK25VL model with backend-aware DeepseekV3 language model.

This is a self-contained implementation that includes all necessary components:

Configuration classes
Vision tower (MoonViT3d with temporal dimension)
Multi-modal projector (PatchMergerMLP)
Language model backend (DeepseekV3)

Module Contents#

Classes#

`MoonViT3dConfig`	Configuration for MoonViT3d vision encoder with temporal support.
`KimiK25VLConfig`	Configuration for KimiK25VL model.
`Learnable2DInterpPosEmbDividedFixed`	Learnable 2D interpolatable position embedding with fixed temporal sincos embedding.
`Rope2DPosEmbRepeated`	2D rotary position embedding repeated for temporal dimension.
`MoonViT3dMLP`	MLP for MoonViT3d.
`MoonViT3dEncoderLayer`	Single encoder layer for MoonViT3d.
`MoonViT3dEncoder`	MoonViT3d encoder with temporal support.
`MoonVision3dPatchEmbed`	Patch embedding for MoonViT3d.
`MoonViT3dPretrainedModel`	MoonViT3d vision encoder with temporal support.
`KimiK25VLMultiModalProjector`	Projects vision features to language model dimension using patch merger MLP.
`DeepSeekV3RotaryEmbeddingAdapter`	Callable adapter that wraps DeepseekV3’s freqs_cis-based RoPE.
`KimiK25VLLanguageModelBackend`	Backend-aware language model wrapper using DeepseekV3 architecture.
`KimiK25VLModel`	KimiK25VL multimodal backbone with a DeepseekV3 text decoder.
`KimiK25VLForConditionalGeneration`	KimiK25VL model with backend-aware DeepseekV3 language model.

Functions#

`get_1d_sincos_pos_embed_from_grid`	Generate 1D sinusoidal positional embedding from grid positions.
`get_1d_sincos_pos_embed`	Generate 1D sinusoidal positional embedding for temporal dimension.
`_apply_rope_vision`	Apply rotary position embedding for vision.
`vision_attention_flash`	Flash attention for vision.
`vision_attention_sdpa`	SDPA attention for vision.
`tpool_patch_merger`	Merge patches with temporal pooling.
`_register_kimi_k25_vl_with_transformers`	Register KimiK25VLConfig and model with transformers Auto classes.
`compute_expanded_seq_length`	Compute the expanded sequence length after image token insertion.

Data#

`LOGGER`
`ModelClass`

API#

nemo_automodel.components.models.kimi_k25_vl.model.LOGGER#: ‘getLogger(…)’

class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dConfig(

patch_size: int = 14,

init_pos_emb_height: int = 64,

init_pos_emb_width: int = 64,

init_pos_emb_time: int = 4,

pos_emb_type: str = 'divided_fixed',

num_attention_heads: int = 16,

num_hidden_layers: int = 27,

hidden_size: int = 1152,

intermediate_size: int = 4304,

merge_kernel_size: Tuple[int, int] = (2, 2),

video_attn_type: str = 'spatial_temporal',

merge_type: str = 'sd2_tpool',

**kwargs,

)#

Bases: transformers.configuration_utils.PretrainedConfig

Configuration for MoonViT3d vision encoder with temporal support.

Initialization

model_type#: ‘moonvit3d’

class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLConfig(

vision_config: Optional[Union[Dict, nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dConfig]] = None,

text_config: Optional[Union[Dict, transformers.models.deepseek_v3.configuration_deepseek_v3.DeepseekV3Config]] = None,

ignore_index: int = -100,

media_placeholder_token_id: int = 163605,

pad_token_id: int = 0,

tie_word_embeddings: bool = False,

mm_projector_type: str = 'patchmerger',

mm_hidden_size: Optional[int] = None,

projector_hidden_act: str = 'gelu',

projector_ln_eps: float = 1e-05,

architectures: Optional[List[str]] = None,

**kwargs,

)#

Bases: transformers.configuration_utils.PretrainedConfig

Configuration for KimiK25VL model.

Supports both ‘kimi_k25_vl’ and ‘kimi_k25’ model types for compatibility with original checkpoints.

Initialization

model_type#: ‘kimi_k25’

to_dict() → Dict[str, Any]#

nemo_automodel.components.models.kimi_k25_vl.model.get_1d_sincos_pos_embed_from_grid( embed_dim: int, pos: numpy.ndarray, ) → numpy.ndarray#: Generate 1D sinusoidal positional embedding from grid positions.

nemo_automodel.components.models.kimi_k25_vl.model.get_1d_sincos_pos_embed( embed_dim: int, t_size: int, cls_token: bool = False, ) → numpy.ndarray#: Generate 1D sinusoidal positional embedding for temporal dimension.

nemo_automodel.components.models.kimi_k25_vl.model._apply_rope_vision( xq: torch.Tensor, xk: torch.Tensor, freqs_cis: torch.Tensor, ) → Tuple[torch.Tensor, torch.Tensor]#: Apply rotary position embedding for vision.

nemo_automodel.components.models.kimi_k25_vl.model.vision_attention_flash( q, k, v, q_cu_seqlens, k_cu_seqlens, max_seqlen_q=None, max_seqlen_k=None, )#: Flash attention for vision.

nemo_automodel.components.models.kimi_k25_vl.model.vision_attention_sdpa(q, k, v, q_cu_seqlens, k_cu_seqlens, **kwargs)#: SDPA attention for vision.

class nemo_automodel.components.models.kimi_k25_vl.model.Learnable2DInterpPosEmbDividedFixed( height: int, width: int, num_frames: int, dim: int, interpolation_mode: str = 'bicubic', )#

Bases: torch.nn.Module

Learnable 2D interpolatable position embedding with fixed temporal sincos embedding.

Initialization

forward(x: torch.Tensor, grid_thws: torch.Tensor) → torch.Tensor#

class nemo_automodel.components.models.kimi_k25_vl.model.Rope2DPosEmbRepeated( dim: int, max_height: int, max_width: int, theta_base: float = 10000, )#

Bases: torch.nn.Module

2D rotary position embedding repeated for temporal dimension.

Initialization

_precompute_freqs_cis(device: torch.device) → torch.Tensor#

get_freqs_cis( grid_thws: torch.Tensor, device: torch.device, ) → torch.Tensor#

class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dMLP(dims: List[int], activation, bias: bool = True)#

Bases: torch.nn.Module

MLP for MoonViT3d.

Initialization

forward(x: torch.Tensor) → torch.Tensor#

class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dEncoderLayer( num_heads: int, hidden_dim: int, mlp_dim: int, *, activation=F.gelu, attn_bias: bool = False, attn_implementation: str = 'flash_attention_2', )#

Bases: torch.nn.Module

Single encoder layer for MoonViT3d.

Initialization

forward( hidden_states: torch.Tensor, cu_seqlens: torch.Tensor, max_seqlen: int, rope_freqs_cis: torch.Tensor, ) → torch.Tensor#

class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dEncoder(hidden_dim: int, num_layers: int, block_cfg: dict)#

Bases: torch.nn.Module

MoonViT3d encoder with temporal support.

Initialization

forward( hidden_states: torch.Tensor, grid_thws: torch.Tensor, ) → torch.Tensor#

class nemo_automodel.components.models.kimi_k25_vl.model.MoonVision3dPatchEmbed( out_dim: int, in_dim: int = 3, patch_size: int = 14, pos_emb_height: int = 64, pos_emb_width: int = 64, pos_emb_time: int = 4, )#

Bases: torch.nn.Module

Patch embedding for MoonViT3d.

Initialization

forward(x: torch.Tensor, grid_thws: torch.Tensor) → torch.Tensor#

nemo_automodel.components.models.kimi_k25_vl.model.tpool_patch_merger( x: torch.Tensor, grid_thws: torch.Tensor, merge_kernel_size: List[int] = [2, 2], ) → List[torch.Tensor]#: Merge patches with temporal pooling.

class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dPretrainedModel(config)#

Bases: torch.nn.Module

MoonViT3d vision encoder with temporal support.

Initialization

property dtype#

forward( pixel_values: torch.Tensor, grid_thws: torch.Tensor, ) → List[torch.Tensor]#

class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLMultiModalProjector(config)#

Bases: torch.nn.Module

Projects vision features to language model dimension using patch merger MLP.

Initialization

forward( image_features: List[torch.Tensor], ) → List[torch.Tensor]#

class nemo_automodel.components.models.kimi_k25_vl.model.DeepSeekV3RotaryEmbeddingAdapter( parent_module: torch.nn.Module, rope_fusion: bool = False, )#

Callable adapter that wraps DeepseekV3’s freqs_cis-based RoPE.

Initialization

property freqs_cis#

__call__( hidden_states: torch.Tensor, position_ids: torch.Tensor, ) → torch.Tensor#

class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLLanguageModelBackend( config, backend: nemo_automodel.components.models.common.BackendConfig, *, moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None, )#

Bases: torch.nn.Module

Backend-aware language model wrapper using DeepseekV3 architecture.

Initialization

get_input_embeddings()#

set_input_embeddings(value)#

forward(

input_ids=None,

*,

inputs_embeds=None,

attention_mask=None,

position_ids=None,

padding_mask=None,

**kwargs,

)#

init_weights(buffer_device=None)#

property embed_tokens#

property layers#

property norm#

class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLModel( config, moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None, backend: nemo_automodel.components.models.common.BackendConfig | None = None, )#

Bases: torch.nn.Module

KimiK25VL multimodal backbone with a DeepseekV3 text decoder.

Initialization

property layers#

property embed_tokens#

property norm#

_compute_num_image_tokens_from_grid( grid_thws: torch.Tensor, ) → List[int]#

Pre-compute number of image tokens from grid_thws without running vision tower.

For 1 image per sample: num_tokens = (h // merge_h) * (w // merge_w) With default merge_kernel_size=(2,2): num_tokens = (h // 2) * (w // 2)

Parameters:: grid_thws – Tensor of shape (batch_size, 3) with [t, h, w] per sample
Returns:: List of expected image token counts per sample

_merge_input_ids_with_image_features( image_features: List[torch.Tensor], inputs_embeds: torch.Tensor, input_ids: torch.Tensor, attention_mask: torch.Tensor, labels: Optional[torch.Tensor] = None, target_seq_length: Optional[int] = None, )#

Merge image features into input embeddings.

Supports two modes:

Pre-expanded (PP mode): input_ids already has N placeholder tokens per image, where N = number of image features. Does simple 1:1 replacement.
Dynamic expansion: input_ids has 1 placeholder per image, expands to N tokens.

Parameters:

image_features – List of image feature tensors, one per image
inputs_embeds – Text embeddings (batch_size, seq_len, embed_dim)
input_ids – Token IDs (batch_size, seq_len)
attention_mask – Attention mask (batch_size, seq_len)
labels – Optional labels for training
target_seq_length – Optional fixed output length for pipeline parallelism.

_extract_image_features(pixel_values, grid_thws)#: Extract and project image features.

forward(

input_ids=None,

attention_mask=None,

position_ids=None,

inputs_embeds=None,

pixel_values=None,

grid_thws=None,

labels=None,

padding_mask=None,

target_seq_length=None,

**kwargs,

)#

Forward pass with optional fixed sequence length for pipeline parallelism.

Parameters:: target_seq_length – If provided, the output after image token expansion will be padded to this fixed length. Required for PP with varying image sizes. Can be pre-computed as: max_text_len - 1 + max_image_tokens where max_image_tokens = (h // 2) * (w // 2) for each image.

class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLForConditionalGeneration(

config,

moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,

backend: nemo_automodel.components.models.common.BackendConfig | None = None,

**kwargs,

)#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, torch.nn.Module, nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixin

KimiK25VL model with backend-aware DeepseekV3 language model.

Initialization

config_class#: None

base_model_prefix#: ‘model’

main_input_name#: ‘pixel_values’

_no_split_modules#: [‘MoonViT3dEncoderLayer’]

supports_gradient_checkpointing#: True

classmethod from_config(

config,

moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,

backend: nemo_automodel.components.models.common.BackendConfig | None = None,

**kwargs,

)#

classmethod from_pretrained(

pretrained_model_name_or_path: str,

*model_args,

**kwargs,

)#

Load model from pretrained path.

Creates the model structure. Weights are loaded by DCP which calls the state_dict_adapter.to_hf() to get checkpoint-format keys (including _packed/_scale/*_shape for INT4), then from_hf() to dequantize.

property dtype#

get_input_embeddings()#

set_input_embeddings(value)#

get_output_embeddings()#

set_output_embeddings(new_embeddings)#

property lm_head#

forward(

input_ids=None,

attention_mask=None,

position_ids=None,

past_key_values=None,

inputs_embeds=None,

labels=None,

use_cache=None,

output_attentions=None,

output_hidden_states=None,

return_dict=None,

pixel_values=None,

grid_thws=None,

padding_mask=None,

target_seq_length=None,

**kwargs,

)#

initialize_weights(buffer_device=None, dtype=torch.bfloat16)#

nemo_automodel.components.models.kimi_k25_vl.model.ModelClass#: None

nemo_automodel.components.models.kimi_k25_vl.model._register_kimi_k25_vl_with_transformers()#: Register KimiK25VLConfig and model with transformers Auto classes.

nemo_automodel.components.models.kimi_k25_vl.model.compute_expanded_seq_length( text_seq_length: int, grid_thws: torch.Tensor, merge_kernel_size: Tuple[int, int] = (2, 2), num_images: int = 1, ) → int#

Compute the expanded sequence length after image token insertion.

For pipeline parallelism, this can be used to pre-compute the target_seq_length parameter needed for fixed-shape outputs.

Parameters:

text_seq_length – Original text sequence length (including 1 placeholder per image)
grid_thws – Tensor of shape (num_images, 3) with [t, h, w] per image
merge_kernel_size – Vision tower’s patch merge kernel size, default (2, 2)
num_images – Number of images (placeholders) in the sequence

Returns:

Expected sequence length after image features are inserted

.. rubric:: Example

For 1 image per sample with grid_thws = [[1, 28, 28]]:#

num_image_tokens = (28 // 2) * (28 // 2) = 196#

expanded_length = text_seq_length - 1 + 196#

grid_thws = torch.tensor([[1, 28, 28]]) compute_expanded_seq_length(82, grid_thws) 277 # 82 - 1 + 196

nemo_automodel.components.models.kimi_k25_vl.model#

Module Contents#

Classes#

Functions#

Data#

API#

For 1 image per sample with grid_thws = [[1, 28, 28]]:#

num_image_tokens = (28 // 2) * (28 // 2) = 196#

expanded_length = text_seq_length - 1 + 196#

`nemo_automodel.components.models.kimi_k25_vl.model`#