nemo_automodel.components.models.kimi_k25_vl.model

View as Markdown

KimiK25VL model with backend-aware DeepseekV3 language model.

This is a self-contained implementation that includes all necessary components:

  • Configuration classes
  • Vision tower (MoonViT3d with temporal dimension)
  • Multi-modal projector (PatchMergerMLP)
  • Language model backend (DeepseekV3)

Module Contents

Classes

NameDescription
DeepSeekV3RotaryEmbeddingAdapterCallable adapter that wraps DeepseekV3’s freqs_cis-based RoPE.
KimiK25VLConfigConfiguration for KimiK25VL model.
KimiK25VLForConditionalGenerationKimiK25VL model with backend-aware DeepseekV3 language model.
KimiK25VLLanguageModelBackendBackend-aware language model wrapper using DeepseekV3 architecture.
KimiK25VLModelKimiK25VL multimodal backbone with a DeepseekV3 text decoder.
KimiK25VLMultiModalProjectorProjects vision features to language model dimension using patch merger MLP.
Learnable2DInterpPosEmbDividedFixedLearnable 2D interpolatable position embedding with fixed temporal sincos embedding.
MoonViT3dConfigConfiguration for MoonViT3d vision encoder with temporal support.
MoonViT3dEncoderMoonViT3d encoder with temporal support.
MoonViT3dEncoderLayerSingle encoder layer for MoonViT3d.
MoonViT3dMLPMLP for MoonViT3d.
MoonViT3dPretrainedModelMoonViT3d vision encoder with temporal support.
MoonVision3dPatchEmbedPatch embedding for MoonViT3d.
Rope2DPosEmbRepeated2D rotary position embedding repeated for temporal dimension.

Functions

NameDescription
_apply_rope_visionApply rotary position embedding for vision.
_register_kimi_k25_vl_with_transformersRegister KimiK25VLConfig and model with transformers Auto classes.
compute_expanded_seq_lengthCompute the expanded sequence length after image token insertion.
get_1d_sincos_pos_embedGenerate 1D sinusoidal positional embedding for temporal dimension.
get_1d_sincos_pos_embed_from_gridGenerate 1D sinusoidal positional embedding from grid positions.
tpool_patch_mergerMerge patches with temporal pooling.
vision_attention_flashFlash attention for vision.
vision_attention_sdpaSDPA attention for vision.

Data

FLASH_ATTN_AVAILABLE

LOGGER

ModelClass

API

class nemo_automodel.components.models.kimi_k25_vl.model.DeepSeekV3RotaryEmbeddingAdapter(
parent_module: torch.nn.Module,
rope_fusion: bool = False
)

Callable adapter that wraps DeepseekV3’s freqs_cis-based RoPE.

nemo_automodel.components.models.kimi_k25_vl.model.DeepSeekV3RotaryEmbeddingAdapter.__call__(
hidden_states: torch.Tensor,
position_ids: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLConfig(
vision_config: typing.Optional[typing.Union[typing.Dict, nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dConfig]] = None,
text_config: typing.Optional[typing.Union[typing.Dict, transformers.models.deepseek_v3.configuration_deepseek_v3.DeepseekV3Config]] = None,
ignore_index: int = -100,
media_placeholder_token_id: int = 163605,
pad_token_id: int = 0,
tie_word_embeddings: bool = False,
mm_projector_type: str = 'patchmerger',
mm_hidden_size: typing.Optional[int] = None,
projector_hidden_act: str = 'gelu',
projector_ln_eps: float = 1e-05,
architectures: typing.Optional[typing.List[str]] = None,
kwargs = {}
)

Bases: PretrainedConfig

Configuration for KimiK25VL model.

Supports both ‘kimi_k25_vl’ and ‘kimi_k25’ model types for compatibility with original checkpoints.

mm_hidden_size
model_type
= 'kimi_k25'
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLConfig.to_dict() -> typing.Dict[str, typing.Any]
class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLForConditionalGeneration(
config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)

Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin

KimiK25VL model with backend-aware DeepseekV3 language model.

_keep_in_fp32_modules
= ['freqs_cis', 'rotary_emb']
_no_split_modules
= ['MoonViT3dEncoderLayer']
_pp_keep_self_forward
bool = True
backend
= backend or BackendConfig()
base_model_prefix
= 'model'
main_input_name
= 'pixel_values'
media_placeholder_token_id
= config.media_placeholder_token_id
model
moe_config
= self.model.moe_config
pad_token_id
state_dict_adapter
vocab_size
= config.text_config.vocab_size
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLForConditionalGeneration.forward(
input_ids = None,
attention_mask = None,
position_ids = None,
past_key_values = None,
inputs_embeds = None,
labels = None,
use_cache = None,
output_attentions = None,
output_hidden_states: typing.Optional[bool] = None,
return_dict = None,
pixel_values = None,
grid_thws = None,
padding_mask = None,
target_seq_length = None,
logits_to_keep: typing.Union[int, torch.Tensor] = 0,
kwargs = {}
)
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLForConditionalGeneration.from_config(
config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)
classmethod
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path: str,
model_args = (),
kwargs = {}
)
classmethod

Load model from pretrained path.

Creates the model structure. Weights are loaded by DCP which calls the state_dict_adapter.to_hf() to get checkpoint-format keys (including _packed/_scale/*_shape for INT4), then from_hf() to dequantize.

nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLForConditionalGeneration.get_input_embeddings()
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLForConditionalGeneration.get_output_embeddings()
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLForConditionalGeneration.initialize_weights(
buffer_device = None,
dtype = torch.bfloat16
)
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLForConditionalGeneration.set_input_embeddings(
value
)
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLForConditionalGeneration.set_output_embeddings(
new_embeddings
)
class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLLanguageModelBackend(
config,
backend: nemo_automodel.components.models.common.BackendConfig,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None
)

Bases: Module

Backend-aware language model wrapper using DeepseekV3 architecture.

model
moe_config
= self.model.moe_config
rotary_emb
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLLanguageModelBackend.forward(
input_ids = None,
inputs_embeds = None,
attention_mask = None,
position_ids = None,
padding_mask = None,
kwargs = {}
)
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLLanguageModelBackend.get_input_embeddings()
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLLanguageModelBackend.init_weights(
buffer_device = None
)
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLLanguageModelBackend.set_input_embeddings(
value
)
class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLModel(
config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None
)

Bases: Module

KimiK25VL multimodal backbone with a DeepseekV3 text decoder.

backend
= backend or BackendConfig()
language_model
media_placeholder_token_id
= config.media_placeholder_token_id
moe_config
= self.language_model.moe_config
multi_modal_projector
= KimiK25VLMultiModalProjector(config)
vision_tower
= MoonViT3dPretrainedModel(config.vision_config)
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLModel._compute_num_image_tokens_from_grid(
grid_thws: torch.Tensor
) -> typing.List[int]

Pre-compute number of image tokens from grid_thws without running vision tower.

For 1 image per sample: num_tokens = (h // merge_h) * (w // merge_w) With default merge_kernel_size=(2,2): num_tokens = (h // 2) * (w // 2)

Parameters:

grid_thws
torch.Tensor

Tensor of shape (batch_size, 3) with [t, h, w] per sample

Returns: List[int]

List of expected image token counts per sample

nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLModel._extract_image_features(
pixel_values,
grid_thws
)

Extract and project image features.

nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLModel._merge_input_ids_with_image_features(
image_features: typing.List[torch.Tensor],
inputs_embeds: torch.Tensor,
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
labels: typing.Optional[torch.Tensor] = None,
target_seq_length: typing.Optional[int] = None
)

Merge image features into input embeddings.

Supports two modes:

  1. Pre-expanded (PP mode): input_ids already has N placeholder tokens per image, where N = number of image features. Does simple 1:1 replacement.
  2. Dynamic expansion: input_ids has 1 placeholder per image, expands to N tokens.

Parameters:

image_features
List[torch.Tensor]

List of image feature tensors, one per image

inputs_embeds
torch.Tensor

Text embeddings (batch_size, seq_len, embed_dim)

input_ids
torch.Tensor

Token IDs (batch_size, seq_len)

attention_mask
torch.Tensor

Attention mask (batch_size, seq_len)

labels
Optional[torch.Tensor]Defaults to None

Optional labels for training

target_seq_length
Optional[int]Defaults to None

Optional fixed output length for pipeline parallelism.

nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLModel.forward(
input_ids = None,
attention_mask = None,
position_ids = None,
inputs_embeds = None,
pixel_values = None,
grid_thws = None,
labels = None,
padding_mask = None,
target_seq_length = None,
kwargs = {}
)

Forward pass with optional fixed sequence length for pipeline parallelism.

Parameters:

target_seq_length
Defaults to None

If provided, the output after image token expansion will be padded to this fixed length. Required for PP with varying image sizes. Can be pre-computed as: max_text_len - 1 + max_image_tokens where max_image_tokens = (h // 2) * (w // 2) for each image.

class nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLMultiModalProjector(
config
)

Bases: Module

Projects vision features to language model dimension using patch merger MLP.

act
= GELUActivation()
hidden_size
linear_1
linear_2
pre_norm
nemo_automodel.components.models.kimi_k25_vl.model.KimiK25VLMultiModalProjector.forward(
image_features: typing.List[torch.Tensor]
) -> typing.List[torch.Tensor]
class nemo_automodel.components.models.kimi_k25_vl.model.Learnable2DInterpPosEmbDividedFixed(
height: int,
width: int,
num_frames: int,
dim: int,
interpolation_mode: str = 'bicubic'
)

Bases: Module

Learnable 2D interpolatable position embedding with fixed temporal sincos embedding.

weight
= nn.Parameter(torch.empty(height, width, dim))
nemo_automodel.components.models.kimi_k25_vl.model.Learnable2DInterpPosEmbDividedFixed.forward(
x: torch.Tensor,
grid_thws: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dConfig(
patch_size: int = 14,
init_pos_emb_height: int = 64,
init_pos_emb_width: int = 64,
init_pos_emb_time: int = 4,
pos_emb_type: str = 'divided_fixed',
num_attention_heads: int = 16,
num_hidden_layers: int = 27,
hidden_size: int = 1152,
intermediate_size: int = 4304,
merge_kernel_size: typing.Tuple[int, int] = (2, 2),
video_attn_type: str = 'spatial_temporal',
merge_type: str = 'sd2_tpool',
kwargs = {}
)

Bases: PretrainedConfig

Configuration for MoonViT3d vision encoder with temporal support.

merge_kernel_size
model_type
= 'moonvit3d'
class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dEncoder(
hidden_dim: int,
num_layers: int,
block_cfg: dict
)

Bases: Module

MoonViT3d encoder with temporal support.

blocks
final_layernorm
= nn.LayerNorm(hidden_dim)
rope_2d
nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dEncoder.forward(
hidden_states: torch.Tensor,
grid_thws: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dEncoderLayer(
num_heads: int,
hidden_dim: int,
mlp_dim: int,
activation = F.gelu,
attn_bias: bool = False,
attn_implementation: str = 'flash_attention_2'
)

Bases: Module

Single encoder layer for MoonViT3d.

head_dim
= hidden_dim // num_heads
mlp
norm0
= nn.LayerNorm(hidden_dim)
norm1
= nn.LayerNorm(hidden_dim)
wo
= nn.Linear(hidden_dim, hidden_dim, bias=attn_bias)
wqkv
nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dEncoderLayer.forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
max_seqlen: int,
rope_freqs_cis: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dMLP(
dims: typing.List[int],
activation,
bias: bool = True
)

Bases: Module

MLP for MoonViT3d.

fc0
= nn.Linear(dims[0], dims[1], bias=bias)
fc1
= nn.Linear(dims[1], dims[2], bias=bias)
nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dMLP.forward(
x: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dPretrainedModel(
config
)

Bases: Module

MoonViT3d vision encoder with temporal support.

encoder
merge_kernel_size
= config.merge_kernel_size
merge_type
= config.merge_type
patch_embed
nemo_automodel.components.models.kimi_k25_vl.model.MoonViT3dPretrainedModel.forward(
pixel_values: torch.Tensor,
grid_thws: torch.Tensor
) -> typing.List[torch.Tensor]
class nemo_automodel.components.models.kimi_k25_vl.model.MoonVision3dPatchEmbed(
out_dim: int,
in_dim: int = 3,
patch_size: int = 14,
pos_emb_height: int = 64,
pos_emb_width: int = 64,
pos_emb_time: int = 4
)

Bases: Module

Patch embedding for MoonViT3d.

patch_size
pos_emb
proj
nemo_automodel.components.models.kimi_k25_vl.model.MoonVision3dPatchEmbed.forward(
x: torch.Tensor,
grid_thws: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.kimi_k25_vl.model.Rope2DPosEmbRepeated(
dim: int,
max_height: int,
max_width: int,
theta_base: float = 10000
)

Bases: Module

2D rotary position embedding repeated for temporal dimension.

nemo_automodel.components.models.kimi_k25_vl.model.Rope2DPosEmbRepeated._precompute_freqs_cis(
device: torch.device
) -> torch.Tensor
nemo_automodel.components.models.kimi_k25_vl.model.Rope2DPosEmbRepeated.get_freqs_cis(
grid_thws: torch.Tensor,
device: torch.device
) -> torch.Tensor
nemo_automodel.components.models.kimi_k25_vl.model._apply_rope_vision(
xq: torch.Tensor,
xk: torch.Tensor,
freqs_cis: torch.Tensor
) -> typing.Tuple[torch.Tensor, torch.Tensor]

Apply rotary position embedding for vision.

nemo_automodel.components.models.kimi_k25_vl.model._register_kimi_k25_vl_with_transformers()

Register KimiK25VLConfig and model with transformers Auto classes.

nemo_automodel.components.models.kimi_k25_vl.model.compute_expanded_seq_length(
text_seq_length: int,
grid_thws: torch.Tensor,
merge_kernel_size: typing.Tuple[int, int] = (2, 2),
num_images: int = 1
) -> int

Compute the expanded sequence length after image token insertion.

For pipeline parallelism, this can be used to pre-compute the target_seq_length parameter needed for fixed-shape outputs.

Parameters:

text_seq_length
int

Original text sequence length (including 1 placeholder per image)

grid_thws
torch.Tensor

Tensor of shape (num_images, 3) with [t, h, w] per image

merge_kernel_size
Tuple[int, int]Defaults to (2, 2)

Vision tower’s patch merge kernel size, default (2, 2)

num_images
intDefaults to 1

Number of images (placeholders) in the sequence

Returns: int

Expected sequence length after image features are inserted

nemo_automodel.components.models.kimi_k25_vl.model.get_1d_sincos_pos_embed(
embed_dim: int,
t_size: int,
cls_token: bool = False
) -> numpy.ndarray

Generate 1D sinusoidal positional embedding for temporal dimension.

nemo_automodel.components.models.kimi_k25_vl.model.get_1d_sincos_pos_embed_from_grid(
embed_dim: int,
pos: numpy.ndarray
) -> numpy.ndarray

Generate 1D sinusoidal positional embedding from grid positions.

nemo_automodel.components.models.kimi_k25_vl.model.tpool_patch_merger(
x: torch.Tensor,
grid_thws: torch.Tensor,
merge_kernel_size: typing.List[int] | None = None
) -> typing.List[torch.Tensor]

Merge patches with temporal pooling.

nemo_automodel.components.models.kimi_k25_vl.model.vision_attention_flash(
q,
k,
v,
q_cu_seqlens,
k_cu_seqlens,
max_seqlen_q = None,
max_seqlen_k = None
)

Flash attention for vision.

nemo_automodel.components.models.kimi_k25_vl.model.vision_attention_sdpa(
q,
k,
v,
q_cu_seqlens,
k_cu_seqlens,
kwargs = {}
)

SDPA attention for vision.

nemo_automodel.components.models.kimi_k25_vl.model.FLASH_ATTN_AVAILABLE = True
nemo_automodel.components.models.kimi_k25_vl.model.LOGGER = logging.getLogger(__name__)
nemo_automodel.components.models.kimi_k25_vl.model.ModelClass = KimiK25VLForConditionalGeneration