core.inference.contexts.dynamic_block_allocator#

Module Contents#

Classes#

BlockAllocator

Allocator that manages blocks of memory for the KV cache.

API#

class core.inference.contexts.dynamic_block_allocator.BlockAllocator(context: DynamicInferenceContext, total_count: int)#

Allocator that manages blocks of memory for the KV cache.

This allocator is responsible for:

  • Initializing a pool of block IDs

  • Allocating blocks from the pool

  • Releasing blocks back to the pool

Parameters:
  • context (DynamicInferenceContext) – Dynamic inference context.

  • active_count (int) – Total number of active blocks available in the buffer. The full buffer size is 2*active_count, to accommodate an equal-size space for paused requests that live on the CPU.

Initialization

__str__()#
get_active_used()#

Compute number of active blocks used.

get_paused_used()#

Compute number of paused blocks used.

get_active_avail()#

Compute number of active blocks available.

get_paused_avail()#

Compute number of paused blocks available.

is_memory_available(num_blocks: int) bool#

Check if memory blocks are available.

Parameters:

num_blocks (int) – Number of blocks to check.

Returns:

(bool) Is memory available?

allocate_memory_blocks(
num_blocks: int,
) Optional[torch.Tensor]#

Allocate memory blocks if available, else return None.

Parameters:

num_blocks (int) – Number of blocks to allocate.

Returns:

(Optional[Tensor]) Allocated block IDs.

release_memory_blocks(blocks: torch.Tensor) None#

Release memory blocks.

Parameters:

blocks (Tensor) – Block IDs to release.

Returns:

None

reset() None#

Reset the allocator to initial state.

This resets the available block count to the entire memory pool (except for the dummy block).