`core.inference.contexts.gpu_view`#

Module Contents#

Classes#

ContextGPUView

GPU-resident snapshot of context bookkeeping data for the forward pass.

API#

class core.inference.contexts.gpu_view.ContextGPUView( max_requests: int, max_tokens: int, max_kv_blocks: int, device: torch.device, max_mamba_chunks: int = 0, )#

GPU-resident snapshot of context bookkeeping data for the forward pass.

This is the ONLY interface GPU code (attention kernels, KV append, RoPE, sampling, log-probs, speculative verification) uses to read context state. CPU bookkeeping code accesses context tensors directly.

Populated once per step by DynamicInferenceContext.transfer_bookkeeping_to_gpu(). All tensors have fixed addresses for CUDA graph compatibility.

Convention: context.foo -> CPU (source of truth, used by bookkeeping) context.gpu_view.foo -> GPU (snapshot, used by forward pass)

Layout note: the bookkeeping fields are backed by a single contiguous uint8 buffer (self._buf). Each field is a view(dtype) onto a slice of that buffer. This matches the pinned-CPU-buffer layout in

Class:: DynamicInferenceContext so that the per-step H2D transfer is a single cudaMemcpyAsync instead of one per field.

Initialization

core.inference.contexts.gpu_view#

Module Contents#

Classes#

API#

`core.inference.contexts.gpu_view`#