nemo_automodel.components.models.nemotron_v3.cache
nemo_automodel.components.models.nemotron_v3.cache
Module Contents
Classes
API
Hybrid KV cache for the NemotronH architecture (attention + Mamba2 layers).
Attention layers accumulate key/value tensors (growing sequence dimension). Mamba2 layers maintain fixed-size conv_state and ssm_state tensors. MLP/MoE layers have no caching.
Modeled after FalconHybridMambaAttentionDynamicCache from transformers.
conv_kernel_size
conv_states
key_cache
ssm_states
value_cache
Return attention KV cache sequence length.
Reorder all caches for beam search.
Attention KV cache: append new K/V and return accumulated tensors.
Update Mamba conv state: full overwrite (prefill) or roll+update (decode).