core.inference.contexts.gpu_view#
Module Contents#
Classes#
GPU-resident snapshot of context bookkeeping data for the forward pass. |
API#
- class core.inference.contexts.gpu_view.ContextGPUView(
- max_requests: int,
- max_tokens: int,
- max_kv_blocks: int,
- device: torch.device,
- max_mamba_chunks: int = 0,
GPU-resident snapshot of context bookkeeping data for the forward pass.
This is the ONLY interface GPU code (attention kernels, KV append, RoPE, sampling, log-probs, speculative verification) uses to read context state. CPU bookkeeping code accesses context tensors directly.
Populated once per step by
DynamicInferenceContext.transfer_bookkeeping_to_gpu(). All tensors have fixed addresses for CUDA graph compatibility.Convention:
context.foo-> CPU (source of truth, used by bookkeeping)context.gpu_view.foo-> GPU (snapshot, used by forward pass)Layout note: the bookkeeping fields are backed by a single contiguous
uint8buffer (self._buf). Each field is aview(dtype)onto a slice of that buffer. This matches the pinned-CPU-buffer layout in- Class:
DynamicInferenceContextso that the per-step H2D transfer is a singlecudaMemcpyAsyncinstead of one per field.
Initialization