`core.inference.contexts.dynamic_block_allocator`#

Module Contents#

Classes#

BlockAllocator

Allocator that manages blocks of memory for the KV cache.

API#

class core.inference.contexts.dynamic_block_allocator.BlockAllocator( context: DynamicInferenceContext, total_count: int, paused_count: int, enable_prefix_caching: bool = False, prefix_caching_eviction_policy: megatron.core.inference.config.PrefixCachingEvictionPolicy = PrefixCachingEvictionPolicy.REF_ZERO, )#

Allocator that manages blocks of memory for the KV cache.

This allocator is responsible for:

Initializing a pool of block IDs
Allocating blocks from the pool
Releasing blocks back to the pool

Parameters:

context (DynamicInferenceContext) – Dynamic inference context.
total_count (int) – Total number of blocks in the buffer.
paused_count (int) – Number of paused blocks in the buffer. Must be less than total_count.

Initialization

__str__()#

get_total_used()#: Compute number of total blocks used.

get_active_used()#: Compute number of active blocks used.

get_paused_used()#: Compute number of paused blocks used.

get_active_avail()#: Compute number of active blocks available.

get_paused_avail()#: Compute number of paused blocks available.

is_memory_available(num_blocks: int) → bool#

Check if memory blocks are available.

Includes both free pool blocks and evictable cached blocks (ref_count == 0).

Parameters:: num_blocks (int) – Number of blocks to check.
Returns:: (bool) Is memory available?

allocate_memory_blocks( num_blocks: int, ) → Optional[torch.Tensor]#

Allocate memory blocks if available, else return None.

Will attempt LRU eviction of cached blocks if the free pool is insufficient.

Parameters:: num_blocks (int) – Number of blocks to allocate.
Returns:: (Optional[Tensor]) Allocated block IDs.

release_memory_blocks(blocks: torch.Tensor) → None#

Release memory blocks by decrementing reference counts.

Blocks with ref_count == 0 remain cached (in hash map) for potential reuse. They will be evicted via LRU when space is needed.

Parameters:: blocks (Tensor) – Block IDs to release.
Returns:: None

reset() → None#

Reset the allocator to initial state.

This resets the available block count to the entire memory pool (except for the dummy block).

register_block_hashes( block_ids: list[int], block_hashes: list[int], ) → None#

Parameters:

block_ids – List of block IDs.
block_hashes – List of computed hash values (same length as block_ids).

_deregister_blocks(block_ids: torch.Tensor) → None#

Remove blocks from prefix caching state and return to free pool.

Shared cleanup logic for both LRU eviction and RZ proactive eviction.

Parameters:: block_ids – Tensor of block IDs to deregister.

update_timestamps(block_ids: torch.Tensor) → None#

Update LRU timestamps for accessed blocks. No-op in RZ mode.

Parameters:: block_ids – Tensor of block IDs that were accessed.

get_evictable_block_count() → torch.Tensor#

Get count of cached blocks that can be evicted (ref_count == 0, hash set).

Returns:: Scalar tensor with the number of evictable cached blocks.

evict_lru_blocks(num_blocks_needed: int) → bool#

Evict LRU cached blocks to free up space in the pool.

Evicts blocks with ref_count == 0, starting with oldest timestamps.

Parameters:: num_blocks_needed – Number of blocks to evict.
Returns:: True if enough blocks were evicted, False otherwise.

core.inference.contexts.dynamic_block_allocator#

Module Contents#

Classes#

API#

`core.inference.contexts.dynamic_block_allocator`#