> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.kimivl.model

## Module Contents

### Classes

| Name                                                                                                                  | Description                                                         |
| --------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
| [`DeepSeekV3RotaryEmbeddingAdapter`](#nemo_automodel-components-models-kimivl-model-DeepSeekV3RotaryEmbeddingAdapter) | Callable adapter that wraps DeepseekV3's freqs\_cis-based RoPE.     |
| [`KimiVLConfig`](#nemo_automodel-components-models-kimivl-model-KimiVLConfig)                                         | Configuration for KimiVL model.                                     |
| [`KimiVLForConditionalGeneration`](#nemo_automodel-components-models-kimivl-model-KimiVLForConditionalGeneration)     | KimiVL model with backend-aware DeepseekV3 language model.          |
| [`KimiVLLanguageModelBackend`](#nemo_automodel-components-models-kimivl-model-KimiVLLanguageModelBackend)             | Backend-aware language model wrapper using DeepseekV3 architecture. |
| [`KimiVLModel`](#nemo_automodel-components-models-kimivl-model-KimiVLModel)                                           | KimiVL multimodal backbone with a DeepseekV3 text decoder.          |
| [`KimiVLMultiModalProjector`](#nemo_automodel-components-models-kimivl-model-KimiVLMultiModalProjector)               | Projects vision features to language model dimension.               |
| [`KimiVLStateDictAdapter`](#nemo_automodel-components-models-kimivl-model-KimiVLStateDictAdapter)                     | State dict adapter for KimiVL checkpoints.                          |
| [`Learnable2DInterpPosEmb`](#nemo_automodel-components-models-kimivl-model-Learnable2DInterpPosEmb)                   | Learnable 2D interpolatable position embedding.                     |
| [`MoonViTConfig`](#nemo_automodel-components-models-kimivl-model-MoonViTConfig)                                       | Configuration for MoonVit vision encoder.                           |
| [`MoonVisionPatchEmbed`](#nemo_automodel-components-models-kimivl-model-MoonVisionPatchEmbed)                         | Patch embedding for MoonVit.                                        |
| [`MoonVitEncoder`](#nemo_automodel-components-models-kimivl-model-MoonVitEncoder)                                     | MoonVit encoder.                                                    |
| [`MoonVitEncoderLayer`](#nemo_automodel-components-models-kimivl-model-MoonVitEncoderLayer)                           | Single encoder layer for MoonVit.                                   |
| [`MoonVitMLP`](#nemo_automodel-components-models-kimivl-model-MoonVitMLP)                                             | MLP for MoonVit.                                                    |
| [`MoonVitPretrainedModel`](#nemo_automodel-components-models-kimivl-model-MoonVitPretrainedModel)                     | MoonVit vision encoder.                                             |
| [`Rope2DPosEmb`](#nemo_automodel-components-models-kimivl-model-Rope2DPosEmb)                                         | -                                                                   |

### Functions

| Name                                                                                                                        | Description                                                     |
| --------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------- |
| [`_apply_rope_vision`](#nemo_automodel-components-models-kimivl-model-_apply_rope_vision)                                   | Apply rotary position embedding for vision.                     |
| [`_register_kimi_vl_with_transformers`](#nemo_automodel-components-models-kimivl-model-_register_kimi_vl_with_transformers) | Register KimiVLConfig and model with transformers Auto classes. |
| [`patch_merger`](#nemo_automodel-components-models-kimivl-model-patch_merger)                                               | Merge patches.                                                  |
| [`vision_attention_flash`](#nemo_automodel-components-models-kimivl-model-vision_attention_flash)                           | Flash attention for vision.                                     |
| [`vision_attention_sdpa`](#nemo_automodel-components-models-kimivl-model-vision_attention_sdpa)                             | SDPA attention for vision.                                      |

### Data

[`FLASH_ATTN_AVAILABLE`](#nemo_automodel-components-models-kimivl-model-FLASH_ATTN_AVAILABLE)

[`LOGGER`](#nemo_automodel-components-models-kimivl-model-LOGGER)

[`ModelClass`](#nemo_automodel-components-models-kimivl-model-ModelClass)

### API

```python
class nemo_automodel.components.models.kimivl.model.DeepSeekV3RotaryEmbeddingAdapter(
    parent_module: torch.nn.Module,
    rope_fusion: bool = False
)
```

Callable adapter that wraps DeepseekV3's freqs\_cis-based RoPE.

This is NOT an nn.Module to avoid being pruned during PP split.
It holds a reference to the parent module's freqs\_cis buffer and computes
position embeddings on demand.

The parent module (KimiVLLanguageModelBackend) owns the freqs\_cis buffer,
and this adapter accesses it via the reference.

Access freqs\_cis from the parent module.

```python
nemo_automodel.components.models.kimivl.model.DeepSeekV3RotaryEmbeddingAdapter.__call__(
    hidden_states: torch.Tensor,
    position_ids: torch.Tensor
) -> torch.Tensor
```

Compute position embeddings from pre-computed freqs\_cis.

**Parameters:**

Input tensor (used only for device/dtype inference)

Position indices tensor

**Returns:** `torch.Tensor`

Position embeddings tensor compatible with DeepseekV3 Block layers

```python
class nemo_automodel.components.models.kimivl.model.KimiVLConfig(
    vision_config: typing.Optional[typing.Union[typing.Dict, nemo_automodel.components.models.kimivl.model.MoonViTConfig]] = None,
    text_config: typing.Optional[typing.Union[typing.Dict, transformers.models.deepseek_v3.configuration_deepseek_v3.DeepseekV3Config]] = None,
    ignore_index: int = -100,
    media_placeholder_token_id: int = 163605,
    pad_token_id: int = 0,
    architectures: typing.Optional[typing.List[str]] = None,
    kwargs = {}
)
```

**Bases:** `PretrainedConfig`

Configuration for KimiVL model.

```python
nemo_automodel.components.models.kimivl.model.KimiVLConfig.to_dict() -> typing.Dict[str, typing.Any]
```

```python
class nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration(
    config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

**Bases:** [HFCheckpointingMixin](/nemo-automodel/nemo_automodel/components/models/common/hf_checkpointing_mixin#nemo_automodel-components-models-common-hf_checkpointing_mixin-HFCheckpointingMixin), `Module`, [MoEFSDPSyncMixin](/nemo-automodel/nemo_automodel/components/moe/fsdp_mixin#nemo_automodel-components-moe-fsdp_mixin-MoEFSDPSyncMixin)

KimiVL model with backend-aware DeepseekV3 language model.

Convenience property to access lm\_head from top level.

```python
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.forward(
    input_ids = None,
    attention_mask = None,
    position_ids = None,
    past_key_values = None,
    inputs_embeds = None,
    labels = None,
    use_cache = None,
    output_attentions = None,
    output_hidden_states: typing.Optional[bool] = None,
    return_dict = None,
    pixel_values = None,
    image_grid_hws = None,
    padding_mask = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    kwargs = {}
)
```

```python
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.from_config(
    config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.get_input_embeddings()
```

```python
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.get_output_embeddings()
```

```python
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.initialize_weights(
    buffer_device = None,
    dtype = torch.bfloat16
)
```

```python
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.set_input_embeddings(
    value
)
```

```python
nemo_automodel.components.models.kimivl.model.KimiVLForConditionalGeneration.set_output_embeddings(
    new_embeddings
)
```

```python
class nemo_automodel.components.models.kimivl.model.KimiVLLanguageModelBackend(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None
)
```

**Bases:** `Module`

Backend-aware language model wrapper using DeepseekV3 architecture.

Note: lm\_head is NOT included here - it's at the top level of
KimiVLForConditionalGeneration to match HF checkpoint structure.

```python
nemo_automodel.components.models.kimivl.model.KimiVLLanguageModelBackend.forward(
    input_ids = None,
    inputs_embeds = None,
    attention_mask = None,
    position_ids = None,
    padding_mask = None,
    kwargs = {}
)
```

```python
nemo_automodel.components.models.kimivl.model.KimiVLLanguageModelBackend.get_input_embeddings()
```

```python
nemo_automodel.components.models.kimivl.model.KimiVLLanguageModelBackend.init_weights(
    buffer_device = None
)
```

```python
nemo_automodel.components.models.kimivl.model.KimiVLLanguageModelBackend.set_input_embeddings(
    value
)
```

```python
class nemo_automodel.components.models.kimivl.model.KimiVLModel(
    config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None
)
```

**Bases:** `Module`

KimiVL multimodal backbone with a DeepseekV3 text decoder.

```python
nemo_automodel.components.models.kimivl.model.KimiVLModel._extract_image_features(
    pixel_values,
    image_grid_hws
)
```

Extract and project image features.

```python
nemo_automodel.components.models.kimivl.model.KimiVLModel._merge_with_image_features(
    inputs_embeds,
    input_ids,
    image_features
)
```

Merge image features into input embeddings.

```python
nemo_automodel.components.models.kimivl.model.KimiVLModel.forward(
    input_ids = None,
    attention_mask = None,
    position_ids = None,
    inputs_embeds = None,
    pixel_values = None,
    image_grid_hws = None,
    padding_mask = None,
    kwargs = {}
)
```

```python
class nemo_automodel.components.models.kimivl.model.KimiVLMultiModalProjector(
    config
)
```

**Bases:** `Module`

Projects vision features to language model dimension.

```python
nemo_automodel.components.models.kimivl.model.KimiVLMultiModalProjector.forward(
    image_features: typing.List[torch.Tensor]
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.kimivl.model.KimiVLStateDictAdapter(
    config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    dtype: torch.dtype = torch.float32
)
```

State dict adapter for KimiVL checkpoints.

```python
nemo_automodel.components.models.kimivl.model.KimiVLStateDictAdapter.from_hf(
    state_dict: dict,
    kwargs = {}
) -> dict
```

```python
nemo_automodel.components.models.kimivl.model.KimiVLStateDictAdapter.to_hf(
    state_dict: dict,
    kwargs = {}
) -> dict
```

```python
class nemo_automodel.components.models.kimivl.model.Learnable2DInterpPosEmb(
    height: int,
    width: int,
    dim: int,
    interpolation_mode: str = 'bicubic'
)
```

**Bases:** `Module`

Learnable 2D interpolatable position embedding.

```python
nemo_automodel.components.models.kimivl.model.Learnable2DInterpPosEmb.forward(
    x: torch.Tensor,
    grid_hws: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.kimivl.model.MoonViTConfig(
    patch_size: int = 14,
    init_pos_emb_height: int = 64,
    init_pos_emb_width: int = 64,
    num_attention_heads: int = 16,
    num_hidden_layers: int = 27,
    hidden_size: int = 1152,
    intermediate_size: int = 4304,
    merge_kernel_size: typing.Tuple[int, int] = (2, 2),
    kwargs = {}
)
```

**Bases:** `PretrainedConfig`

Configuration for MoonVit vision encoder.

```python
class nemo_automodel.components.models.kimivl.model.MoonVisionPatchEmbed(
    out_dim: int,
    in_dim: int = 3,
    patch_size: int = 14,
    pos_emb_height: int = 64,
    pos_emb_width: int = 64
)
```

**Bases:** `Module`

Patch embedding for MoonVit.

```python
nemo_automodel.components.models.kimivl.model.MoonVisionPatchEmbed.forward(
    x: torch.Tensor,
    grid_hws: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.kimivl.model.MoonVitEncoder(
    hidden_dim: int,
    num_layers: int,
    block_cfg: dict
)
```

**Bases:** `Module`

MoonVit encoder.

```python
nemo_automodel.components.models.kimivl.model.MoonVitEncoder.forward(
    hidden_states: torch.Tensor,
    grid_hws: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.kimivl.model.MoonVitEncoderLayer(
    num_heads: int,
    hidden_dim: int,
    mlp_dim: int,
    activation = F.gelu,
    attn_bias: bool = False,
    attn_implementation: str = 'flash_attention_2'
)
```

**Bases:** `Module`

Single encoder layer for MoonVit.

```python
nemo_automodel.components.models.kimivl.model.MoonVitEncoderLayer.forward(
    hidden_states: torch.Tensor,
    cu_seqlens: torch.Tensor,
    rope_freqs_cis: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.kimivl.model.MoonVitMLP(
    dims: typing.List[int],
    activation,
    bias: bool = True
)
```

**Bases:** `Module`

MLP for MoonVit.

```python
nemo_automodel.components.models.kimivl.model.MoonVitMLP.forward(
    x: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.kimivl.model.MoonVitPretrainedModel(
    config
)
```

**Bases:** `Module`

MoonVit vision encoder.

```python
nemo_automodel.components.models.kimivl.model.MoonVitPretrainedModel.forward(
    pixel_values: torch.Tensor,
    grid_hws: torch.Tensor
) -> typing.List[torch.Tensor]
```

```python
class nemo_automodel.components.models.kimivl.model.Rope2DPosEmb(
    dim: int,
    max_height: int,
    max_width: int,
    theta_base: float = 10000
)
```

**Bases:** `Module`

```python
nemo_automodel.components.models.kimivl.model.Rope2DPosEmb._precompute_freqs_cis(
    device: torch.device
) -> torch.Tensor
```

```python
nemo_automodel.components.models.kimivl.model.Rope2DPosEmb.get_freqs_cis(
    grid_hws: torch.Tensor
) -> torch.Tensor
```

```python
nemo_automodel.components.models.kimivl.model._apply_rope_vision(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cis: torch.Tensor
) -> typing.Tuple[torch.Tensor, torch.Tensor]
```

Apply rotary position embedding for vision.

```python
nemo_automodel.components.models.kimivl.model._register_kimi_vl_with_transformers()
```

Register KimiVLConfig and model with transformers Auto classes.

This uses the official transformers registration API. When registered,
AutoModelForImageTextToText.from\_pretrained will use our local implementation
directly, bypassing the trust\_remote\_code mechanism entirely.

```python
nemo_automodel.components.models.kimivl.model.patch_merger(
    x: torch.Tensor,
    grid_hws: torch.Tensor,
    merge_kernel_size: typing.List[int] | None = None
) -> typing.List[torch.Tensor]
```

Merge patches.

```python
nemo_automodel.components.models.kimivl.model.vision_attention_flash(
    q,
    k,
    v,
    q_cu_seqlens,
    k_cu_seqlens
)
```

Flash attention for vision.

```python
nemo_automodel.components.models.kimivl.model.vision_attention_sdpa(
    q,
    k,
    v,
    q_cu_seqlens,
    k_cu_seqlens
)
```

SDPA attention for vision.

```python
nemo_automodel.components.models.kimivl.model.FLASH_ATTN_AVAILABLE = True
```

```python
nemo_automodel.components.models.kimivl.model.LOGGER = logging.getLogger(__name__)
```

```python
nemo_automodel.components.models.kimivl.model.ModelClass = KimiVLForConditionalGeneration
```