core.inference.contexts.kv_block_allocator#
Module Contents#
Classes#
Allocator that manages blocks of memory for the KV cache. |
API#
- class core.inference.contexts.kv_block_allocator.KVBlockAllocator(
- context: DynamicInferenceContext,
- total_count: int,
- paused_count: int,
- enable_prefix_caching: bool = False,
- prefix_caching_eviction_policy: megatron.core.inference.config.PrefixCachingEvictionPolicy = PrefixCachingEvictionPolicy.REF_ZERO,
Allocator that manages blocks of memory for the KV cache.
This allocator is responsible for:
Initializing a pool of block IDs
Allocating blocks from the pool
Releasing blocks back to the pool
- Parameters:
context (DynamicInferenceContext) – Dynamic inference context.
total_count (int) – Total number of blocks in the buffer.
paused_count (int) – Number of paused blocks in the buffer. Must be less than
total_count.
Initialization
- __str__()#
- get_total_used()#
Compute number of total blocks used.
- get_active_used()#
Compute number of active blocks used.
- get_paused_used()#
Compute number of paused blocks used.
- get_active_avail()#
Compute number of active blocks available.
- get_paused_avail()#
Compute number of paused blocks available.
- is_memory_available(num_blocks: int) bool#
Check if memory blocks are available.
Includes both free pool blocks and evictable cached blocks (ref_count == 0).
- Parameters:
num_blocks (int) – Number of blocks to check.
- Returns:
(bool) Is memory available?
- allocate_memory_blocks(
- num_blocks: int,
Allocate memory blocks if available, else return None.
Will attempt LRU eviction of cached blocks if the free pool is insufficient.
- Parameters:
num_blocks (int) – Number of blocks to allocate.
- Returns:
(Optional[Tensor]) Allocated block IDs.
- release_memory_blocks(blocks: torch.Tensor) None#
Release memory blocks by decrementing reference counts.
Blocks with ref_count == 0 remain cached (in hash map) for potential reuse. They will be evicted via LRU when space is needed.
- Parameters:
blocks (Tensor) – Block IDs to release.
- Returns:
None
- reset() None#
Reset the allocator to initial state.
This resets the available block count to the entire memory pool (except for the dummy block).
- register_kv_block_hashes(
- block_ids: list[int],
- block_hashes: list[int],
Register blocks in the hash-to-block mapping for discovery (batch).
- Parameters:
block_ids – List of block IDs.
block_hashes – List of computed hash values (same length as block_ids).
- _deregister_blocks(block_ids: torch.Tensor) None#
Remove blocks from prefix caching state and return to free pool.
Shared cleanup logic for both LRU eviction and RZ proactive eviction.
- Parameters:
block_ids – Tensor of block IDs to deregister.
- update_timestamps(block_ids: torch.Tensor) None#
Update LRU timestamps for accessed blocks. No-op in RZ mode.
- Parameters:
block_ids – Tensor of block IDs that were accessed.
- get_evictable_block_count() torch.Tensor#
Get count of cached blocks that can be evicted (ref_count == 0, hash set).
- Returns:
Scalar tensor with the number of evictable cached blocks.
- evict_lru_blocks(num_blocks_needed: int) bool#
Evict LRU cached blocks to free up space in the pool.
Evicts blocks with ref_count == 0, starting with oldest timestamps.
- Parameters:
num_blocks_needed – Number of blocks to evict.
- Returns:
True if enough blocks were evicted, False otherwise.
- store_routing_per_block(
- flat_routing: Optional[numpy.ndarray],
Scatter flat routing indices into per-block storage.
Uses the context’s token-to-block mapping to distribute each token’s routing data into the appropriate block. Matched (prefix-cached) blocks already have routing from the original request and are not overwritten here since their tokens are not in the active token layout.
- Parameters:
flat_routing – ndarray of shape [active_token_count, num_layers, topk] aligned with the context’s active-token layout, or None.
- reconstruct_routing_from_blocks(
- block_ids: list[int],
- total_routing_tokens: int,
Reconstruct routing indices from per-block storage.
Concatenates per-block routing ndarrays in block order, trimming the last block to exactly
total_routing_tokensentries.- Parameters:
block_ids – Ordered list of block IDs for the request.
total_routing_tokens – Expected number of routing tokens (total_tokens - 1, since the last generated token has no forward-pass routing).
- Returns:
ndarray [total_routing_tokens, num_layers, topk] or None if any block is missing routing data.
- store_block_routing(
- block_id: int,
- positions: numpy.ndarray,
- routing: numpy.ndarray,
Store routing indices for specific token positions in a block.
- Parameters:
block_id – The block ID.
positions – ndarray of token positions within the block (1D, int).
routing – ndarray of routing data [num_positions, num_layers, topk].
- get_block_routing(block_id: int) Optional[numpy.ndarray]#
Get routing indices for a block.
- Parameters:
block_id – The block ID.
- Returns:
ndarray [block_size_tokens, num_layers, topk] or None if not stored.