nemo_automodel.components.models.kimivl.model

View as Markdown

Module Contents

Classes

NameDescription
DeepSeekV3RotaryEmbeddingAdapterCallable adapter that wraps DeepseekV3’s freqs_cis-based RoPE.
KimiVLConfigConfiguration for KimiVL model.
KimiVLForConditionalGenerationKimiVL model with backend-aware DeepseekV3 language model.
KimiVLLanguageModelBackendBackend-aware language model wrapper using DeepseekV3 architecture.
KimiVLModelKimiVL multimodal backbone with a DeepseekV3 text decoder.
KimiVLMultiModalProjectorProjects vision features to language model dimension.
KimiVLStateDictAdapterState dict adapter for KimiVL checkpoints.
Learnable2DInterpPosEmbLearnable 2D interpolatable position embedding.
MoonViTConfigConfiguration for MoonVit vision encoder.
MoonVisionPatchEmbedPatch embedding for MoonVit.
MoonVitEncoderMoonVit encoder.
MoonVitEncoderLayerSingle encoder layer for MoonVit.
MoonVitMLPMLP for MoonVit.
MoonVitPretrainedModelMoonVit vision encoder.
Rope2DPosEmb-

Functions

NameDescription
_apply_rope_visionApply rotary position embedding for vision.
_register_kimi_vl_with_transformersRegister KimiVLConfig and model with transformers Auto classes.
patch_mergerMerge patches.
vision_attention_flashFlash attention for vision.
vision_attention_sdpaSDPA attention for vision.

Data

FLASH_ATTN_AVAILABLE

LOGGER

ModelClass

API

class nemo_automodel.components.models.kimivl.model.DeepSeekV3RotaryEmbeddingAdapter(
parent_module: torch.nn.Module,
rope_fusion: bool = False
)

Callable adapter that wraps DeepseekV3’s freqs_cis-based RoPE.

This is NOT an nn.Module to avoid being pruned during PP split. It holds a reference to the parent module’s freqs_cis buffer and computes position embeddings on demand.

The parent module (KimiVLLanguageModelBackend) owns the freqs_cis buffer, and this adapter accesses it via the reference.

freqs_cis

Access freqs_cis from the parent module.

nemo_automodel.components.models.kimivl.model.DeepSeekV3RotaryEmbeddingAdapter.__call__(
hidden_states: torch.Tensor,
position_ids: torch.Tensor
) -> torch.Tensor

Compute position embeddings from pre-computed freqs_cis.

Parameters:

hidden_states
torch.Tensor

Input tensor (used only for device/dtype inference)

position_ids
torch.Tensor

Position indices tensor

Returns: torch.Tensor

Position embeddings tensor compatible with DeepseekV3 Block layers

class nemo_automodel.components.models.kimivl.model.KimiVLConfig(
vision_config: typing.Optional[typing.Union[typing.Dict, nemo_automodel.components.models.kimivl.model.MoonViTConfig]] = None,
text_config: typing.Optional[typing.Union[typing.Dict, transformers.models.deepseek_v3.configuration_deepseek_v3.DeepseekV3Config]] = None,
ignore_index: int = -100,
media_placeholder_token_id: int = 163605,
pad_token_id: int = 0,
architectures: typing.Optional[typing.List[str]] = None,
kwargs = {}
)

Bases: PretrainedConfig

Configuration for KimiVL model.

model_type
= 'kimi_vl'
nemo_automodel.components.models.kimivl.model.KimiVLConfig.to_dict() -> typing.Dict[str, typing.Any]
class nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration(
config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)

Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin

KimiVL model with backend-aware DeepseekV3 language model.

_pp_keep_self_forward
bool = True
backend
= backend or BackendConfig()
lm_head

Convenience property to access lm_head from top level.

media_placeholder_token_id
= config.media_placeholder_token_id
model
moe_config
= self.model.moe_config
pad_token_id
state_dict_adapter
vocab_size
= config.text_config.vocab_size
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.forward(
input_ids = None,
attention_mask = None,
position_ids = None,
past_key_values = None,
inputs_embeds = None,
labels = None,
use_cache = None,
output_attentions = None,
output_hidden_states: typing.Optional[bool] = None,
return_dict = None,
pixel_values = None,
image_grid_hws = None,
padding_mask = None,
logits_to_keep: typing.Union[int, torch.Tensor] = 0,
kwargs = {}
)
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.from_config(
config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)
classmethod
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path: str,
model_args = (),
kwargs = {}
)
classmethod
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.get_input_embeddings()
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.get_output_embeddings()
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.initialize_weights(
buffer_device = None,
dtype = torch.bfloat16
)
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.set_input_embeddings(
value
)
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.set_output_embeddings(
new_embeddings
)
class nemo_automodel.components.models.kimivl.model.KimiVLLanguageModelBackend(
config,
backend: nemo_automodel.components.models.common.BackendConfig,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None
)

Bases: Module

Backend-aware language model wrapper using DeepseekV3 architecture.

Note: lm_head is NOT included here - it’s at the top level of KimiVLForConditionalGeneration to match HF checkpoint structure.

model
moe_config
= self.model.moe_config
rotary_emb
nemo_automodel.components.models.kimivl.model.KimiVLLanguageModelBackend.forward(
input_ids = None,
inputs_embeds = None,
attention_mask = None,
position_ids = None,
padding_mask = None,
kwargs = {}
)
nemo_automodel.components.models.kimivl.model.KimiVLLanguageModelBackend.get_input_embeddings()
nemo_automodel.components.models.kimivl.model.KimiVLLanguageModelBackend.init_weights(
buffer_device = None
)
nemo_automodel.components.models.kimivl.model.KimiVLLanguageModelBackend.set_input_embeddings(
value
)
class nemo_automodel.components.models.kimivl.model.KimiVLModel(
config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None
)

Bases: Module

KimiVL multimodal backbone with a DeepseekV3 text decoder.

backend
= backend or BackendConfig()
language_model
media_placeholder_token_id
= config.media_placeholder_token_id
moe_config
= self.language_model.moe_config
multi_modal_projector
= KimiVLMultiModalProjector(config)
vision_tower
= MoonVitPretrainedModel(config.vision_config)
nemo_automodel.components.models.kimivl.model.KimiVLModel._extract_image_features(
pixel_values,
image_grid_hws
)

Extract and project image features.

nemo_automodel.components.models.kimivl.model.KimiVLModel._merge_with_image_features(
inputs_embeds,
input_ids,
image_features
)

Merge image features into input embeddings.

nemo_automodel.components.models.kimivl.model.KimiVLModel.forward(
input_ids = None,
attention_mask = None,
position_ids = None,
inputs_embeds = None,
pixel_values = None,
image_grid_hws = None,
padding_mask = None,
kwargs = {}
)
class nemo_automodel.components.models.kimivl.model.KimiVLMultiModalProjector(
config
)

Bases: Module

Projects vision features to language model dimension.

act
= GELUActivation()
hidden_size
linear_1
linear_2
pre_norm
= nn.LayerNorm(vision_config.hidden_size, eps=1e-05)
nemo_automodel.components.models.kimivl.model.KimiVLMultiModalProjector.forward(
image_features: typing.List[torch.Tensor]
) -> torch.Tensor
class nemo_automodel.components.models.kimivl.model.KimiVLStateDictAdapter(
config,
moe_config: nemo_automodel.components.moe.config.MoEConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
dtype: torch.dtype = torch.float32
)

State dict adapter for KimiVL checkpoints.

_last_expected_hf_keys
set[str] | None = None
llm_adapter
nemo_automodel.components.models.kimivl.model.KimiVLStateDictAdapter.from_hf(
state_dict: dict,
kwargs = {}
) -> dict
nemo_automodel.components.models.kimivl.model.KimiVLStateDictAdapter.to_hf(
state_dict: dict,
kwargs = {}
) -> dict
class nemo_automodel.components.models.kimivl.model.Learnable2DInterpPosEmb(
height: int,
width: int,
dim: int,
interpolation_mode: str = 'bicubic'
)

Bases: Module

Learnable 2D interpolatable position embedding.

weight
= nn.Parameter(torch.empty(height, width, dim))
nemo_automodel.components.models.kimivl.model.Learnable2DInterpPosEmb.forward(
x: torch.Tensor,
grid_hws: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.kimivl.model.MoonViTConfig(
patch_size: int = 14,
init_pos_emb_height: int = 64,
init_pos_emb_width: int = 64,
num_attention_heads: int = 16,
num_hidden_layers: int = 27,
hidden_size: int = 1152,
intermediate_size: int = 4304,
merge_kernel_size: typing.Tuple[int, int] = (2, 2),
kwargs = {}
)

Bases: PretrainedConfig

Configuration for MoonVit vision encoder.

merge_kernel_size
model_type
= 'moonvit'
class nemo_automodel.components.models.kimivl.model.MoonVisionPatchEmbed(
out_dim: int,
in_dim: int = 3,
patch_size: int = 14,
pos_emb_height: int = 64,
pos_emb_width: int = 64
)

Bases: Module

Patch embedding for MoonVit.

patch_size
pos_emb
proj
nemo_automodel.components.models.kimivl.model.MoonVisionPatchEmbed.forward(
x: torch.Tensor,
grid_hws: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.kimivl.model.MoonVitEncoder(
hidden_dim: int,
num_layers: int,
block_cfg: dict
)

Bases: Module

MoonVit encoder.

blocks
final_layernorm
= nn.LayerNorm(hidden_dim)
rope_2d
nemo_automodel.components.models.kimivl.model.MoonVitEncoder.forward(
hidden_states: torch.Tensor,
grid_hws: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.kimivl.model.MoonVitEncoderLayer(
num_heads: int,
hidden_dim: int,
mlp_dim: int,
activation = F.gelu,
attn_bias: bool = False,
attn_implementation: str = 'flash_attention_2'
)

Bases: Module

Single encoder layer for MoonVit.

head_dim
= hidden_dim // num_heads
mlp
norm0
= nn.LayerNorm(hidden_dim)
norm1
= nn.LayerNorm(hidden_dim)
wo
= nn.Linear(hidden_dim, hidden_dim, bias=attn_bias)
wqkv
nemo_automodel.components.models.kimivl.model.MoonVitEncoderLayer.forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.Tensor,
rope_freqs_cis: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.kimivl.model.MoonVitMLP(
dims: typing.List[int],
activation,
bias: bool = True
)

Bases: Module

MLP for MoonVit.

fc0
= nn.Linear(dims[0], dims[1], bias=bias)
fc1
= nn.Linear(dims[1], dims[2], bias=bias)
nemo_automodel.components.models.kimivl.model.MoonVitMLP.forward(
x: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.kimivl.model.MoonVitPretrainedModel(
config
)

Bases: Module

MoonVit vision encoder.

encoder
merge_kernel_size
= config.merge_kernel_size
patch_embed
nemo_automodel.components.models.kimivl.model.MoonVitPretrainedModel.forward(
pixel_values: torch.Tensor,
grid_hws: torch.Tensor
) -> typing.List[torch.Tensor]
class nemo_automodel.components.models.kimivl.model.Rope2DPosEmb(
dim: int,
max_height: int,
max_width: int,
theta_base: float = 10000
)

Bases: Module

nemo_automodel.components.models.kimivl.model.Rope2DPosEmb._precompute_freqs_cis(
device: torch.device
) -> torch.Tensor
nemo_automodel.components.models.kimivl.model.Rope2DPosEmb.get_freqs_cis(
grid_hws: torch.Tensor
) -> torch.Tensor
nemo_automodel.components.models.kimivl.model._apply_rope_vision(
xq: torch.Tensor,
xk: torch.Tensor,
freqs_cis: torch.Tensor
) -> typing.Tuple[torch.Tensor, torch.Tensor]

Apply rotary position embedding for vision.

nemo_automodel.components.models.kimivl.model._register_kimi_vl_with_transformers()

Register KimiVLConfig and model with transformers Auto classes.

This uses the official transformers registration API. When registered, AutoModelForImageTextToText.from_pretrained will use our local implementation directly, bypassing the trust_remote_code mechanism entirely.

nemo_automodel.components.models.kimivl.model.patch_merger(
x: torch.Tensor,
grid_hws: torch.Tensor,
merge_kernel_size: typing.List[int] | None = None
) -> typing.List[torch.Tensor]

Merge patches.

nemo_automodel.components.models.kimivl.model.vision_attention_flash(
q,
k,
v,
q_cu_seqlens,
k_cu_seqlens
)

Flash attention for vision.

nemo_automodel.components.models.kimivl.model.vision_attention_sdpa(
q,
k,
v,
q_cu_seqlens,
k_cu_seqlens
)

SDPA attention for vision.

nemo_automodel.components.models.kimivl.model.FLASH_ATTN_AVAILABLE = True
nemo_automodel.components.models.kimivl.model.LOGGER = logging.getLogger(__name__)
nemo_automodel.components.models.kimivl.model.ModelClass = KimiVLForConditionalGeneration