core.inference.contexts.kv_block_allocator#

Module Contents#

Classes#

KVBlockAllocator

Allocator that manages blocks of memory for the KV cache.

API#

class core.inference.contexts.kv_block_allocator.KVBlockAllocator(
context: DynamicInferenceContext,
total_count: int,
paused_count: int,
enable_prefix_caching: bool = False,
prefix_caching_eviction_policy: megatron.core.inference.config.PrefixCachingEvictionPolicy = PrefixCachingEvictionPolicy.REF_ZERO,
)#

Allocator that manages blocks of memory for the KV cache.

This allocator is responsible for:

  • Initializing a pool of block IDs

  • Allocating blocks from the pool

  • Releasing blocks back to the pool

Parameters:
  • context (DynamicInferenceContext) – Dynamic inference context.

  • total_count (int) – Total number of blocks in the buffer.

  • paused_count (int) – Number of paused blocks in the buffer. Must be less than total_count.

Initialization

__str__()#
get_total_used()#

Compute number of total blocks used.

get_active_used()#

Compute number of active blocks used.

get_paused_used()#

Compute number of paused blocks used.

get_active_avail()#

Compute number of active blocks available.

get_paused_avail()#

Compute number of paused blocks available.

is_memory_available(num_blocks: int) bool#

Check if memory blocks are available.

Includes both free pool blocks and evictable cached blocks (ref_count == 0).

Parameters:

num_blocks (int) – Number of blocks to check.

Returns:

(bool) Is memory available?

allocate_memory_blocks(
num_blocks: int,
) Optional[torch.Tensor]#

Allocate memory blocks if available, else return None.

Will attempt LRU eviction of cached blocks if the free pool is insufficient.

Parameters:

num_blocks (int) – Number of blocks to allocate.

Returns:

(Optional[Tensor]) Allocated block IDs.

release_memory_blocks(blocks: torch.Tensor) None#

Release memory blocks by decrementing reference counts.

Blocks with ref_count == 0 remain cached (in hash map) for potential reuse. They will be evicted via LRU when space is needed.

Parameters:

blocks (Tensor) – Block IDs to release.

Returns:

None

reset() None#

Reset the allocator to initial state.

This resets the available block count to the entire memory pool (except for the dummy block).

register_kv_block_hashes(
block_ids: list[int],
block_hashes: list[int],
) None#

Register blocks in the hash-to-block mapping for discovery (batch).

Parameters:
  • block_ids – List of block IDs.

  • block_hashes – List of computed hash values (same length as block_ids).

_deregister_blocks(block_ids: torch.Tensor) None#

Remove blocks from prefix caching state and return to free pool.

Shared cleanup logic for both LRU eviction and RZ proactive eviction.

Parameters:

block_ids – Tensor of block IDs to deregister.

update_timestamps(block_ids: torch.Tensor) None#

Update LRU timestamps for accessed blocks. No-op in RZ mode.

Parameters:

block_ids – Tensor of block IDs that were accessed.

get_evictable_block_count() torch.Tensor#

Get count of cached blocks that can be evicted (ref_count == 0, hash set).

Returns:

Scalar tensor with the number of evictable cached blocks.

evict_lru_blocks(num_blocks_needed: int) bool#

Evict LRU cached blocks to free up space in the pool.

Evicts blocks with ref_count == 0, starting with oldest timestamps.

Parameters:

num_blocks_needed – Number of blocks to evict.

Returns:

True if enough blocks were evicted, False otherwise.

store_routing_per_block(
flat_routing: Optional[numpy.ndarray],
) None#

Scatter flat routing indices into per-block storage.

Uses the context’s token-to-block mapping to distribute each token’s routing data into the appropriate block. Matched (prefix-cached) blocks already have routing from the original request and are not overwritten here since their tokens are not in the active token layout.

Parameters:

flat_routing – ndarray of shape [active_token_count, num_layers, topk] aligned with the context’s active-token layout, or None.

reconstruct_routing_from_blocks(
block_ids: list[int],
total_routing_tokens: int,
) Optional[numpy.ndarray]#

Reconstruct routing indices from per-block storage.

Concatenates per-block routing ndarrays in block order, trimming the last block to exactly total_routing_tokens entries.

Parameters:
  • block_ids – Ordered list of block IDs for the request.

  • total_routing_tokens – Expected number of routing tokens (total_tokens - 1, since the last generated token has no forward-pass routing).

Returns:

ndarray [total_routing_tokens, num_layers, topk] or None if any block is missing routing data.

store_block_routing(
block_id: int,
positions: numpy.ndarray,
routing: numpy.ndarray,
) None#

Store routing indices for specific token positions in a block.

Parameters:
  • block_id – The block ID.

  • positions – ndarray of token positions within the block (1D, int).

  • routing – ndarray of routing data [num_positions, num_layers, topk].

get_block_routing(block_id: int) Optional[numpy.ndarray]#

Get routing indices for a block.

Parameters:

block_id – The block ID.

Returns:

ndarray [block_size_tokens, num_layers, topk] or None if not stored.