core.inference.contexts.gpu_view#

Module Contents#

Classes#

ContextGPUView

GPU-resident snapshot of context bookkeeping data for the forward pass.

API#

class core.inference.contexts.gpu_view.ContextGPUView(
max_requests: int,
max_tokens: int,
max_kv_blocks: int,
device: torch.device,
max_mamba_chunks: int = 0,
)#

GPU-resident snapshot of context bookkeeping data for the forward pass.

This is the ONLY interface GPU code (attention kernels, KV append, RoPE, sampling, log-probs, speculative verification) uses to read context state. CPU bookkeeping code accesses context tensors directly.

Populated once per step by DynamicInferenceContext.transfer_bookkeeping_to_gpu(). All tensors have fixed addresses for CUDA graph compatibility.

Convention: context.foo -> CPU (source of truth, used by bookkeeping) context.gpu_view.foo -> GPU (snapshot, used by forward pass)

Layout note: the bookkeeping fields are backed by a single contiguous uint8 buffer (self._buf). Each field is a view(dtype) onto a slice of that buffer. This matches the pinned-CPU-buffer layout in

Class:

DynamicInferenceContext so that the per-step H2D transfer is a single cudaMemcpyAsync instead of one per field.

Initialization