> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.nemotron_v3.cache

## Module Contents

### Classes

| Name                                                                                             | Description                                                                 |
| ------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------- |
| [`NemotronHybridCache`](#nemo_automodel-components-models-nemotron_v3-cache-NemotronHybridCache) | Hybrid KV cache for the NemotronH architecture (attention + Mamba2 layers). |

### API

```python
class nemo_automodel.components.models.nemotron_v3.cache.NemotronHybridCache(
    config,
    batch_size: int,
    dtype: torch.dtype,
    device: torch.device
)
```

Hybrid KV cache for the NemotronH architecture (attention + Mamba2 layers).

Attention layers accumulate key/value tensors (growing sequence dimension).
Mamba2 layers maintain fixed-size conv\_state and ssm\_state tensors.
MLP/MoE layers have no caching.

Modeled after `FalconHybridMambaAttentionDynamicCache` from transformers.

```python
nemo_automodel.components.models.nemotron_v3.cache.NemotronHybridCache.get_seq_length(
    layer_idx: int | None = None
) -> int
```

Return attention KV cache sequence length.

```python
nemo_automodel.components.models.nemotron_v3.cache.NemotronHybridCache.reorder_cache(
    beam_idx: torch.LongTensor
) -> None
```

Reorder all caches for beam search.

```python
nemo_automodel.components.models.nemotron_v3.cache.NemotronHybridCache.update(
    key_states: torch.Tensor,
    value_states: torch.Tensor,
    layer_idx: int,
    cache_kwargs: dict[str, typing.Any] | None = None
) -> tuple[torch.Tensor, torch.Tensor]
```

Attention KV cache: append new K/V and return accumulated tensors.

```python
nemo_automodel.components.models.nemotron_v3.cache.NemotronHybridCache.update_conv_state(
    layer_idx: int,
    new_conv_state: torch.Tensor,
    cache_position: torch.LongTensor
) -> torch.Tensor
```

Update Mamba conv state: full overwrite (prefill) or roll+update (decode).