nemo_automodel.components.models.kimivl.model
nemo_automodel.components.models.kimivl.model
Module Contents
Classes
Functions
Data
API
Callable adapter that wraps DeepseekV3’s freqs_cis-based RoPE.
This is NOT an nn.Module to avoid being pruned during PP split. It holds a reference to the parent module’s freqs_cis buffer and computes position embeddings on demand.
The parent module (KimiVLLanguageModelBackend) owns the freqs_cis buffer, and this adapter accesses it via the reference.
Access freqs_cis from the parent module.
Compute position embeddings from pre-computed freqs_cis.
Parameters:
Input tensor (used only for device/dtype inference)
Position indices tensor
Returns: torch.Tensor
Position embeddings tensor compatible with DeepseekV3 Block layers
Bases: PretrainedConfig
Configuration for KimiVL model.
Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin
KimiVL model with backend-aware DeepseekV3 language model.
Convenience property to access lm_head from top level.
Bases: Module
Backend-aware language model wrapper using DeepseekV3 architecture.
Note: lm_head is NOT included here - it’s at the top level of KimiVLForConditionalGeneration to match HF checkpoint structure.
Bases: Module
KimiVL multimodal backbone with a DeepseekV3 text decoder.
Extract and project image features.
Merge image features into input embeddings.
Bases: Module
Projects vision features to language model dimension.
State dict adapter for KimiVL checkpoints.
Bases: Module
Learnable 2D interpolatable position embedding.
Bases: PretrainedConfig
Configuration for MoonVit vision encoder.
Bases: Module
Patch embedding for MoonVit.
Bases: Module
MoonVit encoder.
Bases: Module
Single encoder layer for MoonVit.
Bases: Module
MLP for MoonVit.
Bases: Module
MoonVit vision encoder.
Bases: Module
Apply rotary position embedding for vision.
Register KimiVLConfig and model with transformers Auto classes.
This uses the official transformers registration API. When registered, AutoModelForImageTextToText.from_pretrained will use our local implementation directly, bypassing the trust_remote_code mechanism entirely.
Merge patches.
Flash attention for vision.
SDPA attention for vision.