core.inference.contexts.dynamic_block_allocator#
Module Contents#
Classes#
Allocator that manages blocks of memory for the KV cache. |
API#
- class core.inference.contexts.dynamic_block_allocator.BlockAllocator(context: DynamicInferenceContext, total_count: int)#
Allocator that manages blocks of memory for the KV cache.
This allocator is responsible for:
Initializing a pool of block IDs
Allocating blocks from the pool
Releasing blocks back to the pool
- Parameters:
context (DynamicInferenceContext) – Dynamic inference context.
active_count (int) – Total number of active blocks available in the buffer. The full buffer size is 2*active_count, to accommodate an equal-size space for paused requests that live on the CPU.
Initialization
- __str__()#
- get_active_used()#
Compute number of active blocks used.
- get_paused_used()#
Compute number of paused blocks used.
- get_active_avail()#
Compute number of active blocks available.
- get_paused_avail()#
Compute number of paused blocks available.
- is_memory_available(num_blocks: int) bool#
Check if memory blocks are available.
- Parameters:
num_blocks (int) – Number of blocks to check.
- Returns:
(bool) Is memory available?
- allocate_memory_blocks(
- num_blocks: int,
Allocate memory blocks if available, else return None.
- Parameters:
num_blocks (int) – Number of blocks to allocate.
- Returns:
(Optional[Tensor]) Allocated block IDs.
- release_memory_blocks(blocks: torch.Tensor) None#
Release memory blocks.
- Parameters:
blocks (Tensor) – Block IDs to release.
- Returns:
None
- reset() None#
Reset the allocator to initial state.
This resets the available block count to the entire memory pool (except for the dummy block).