core.inference.sampling.base#
Module Contents#
Classes#
Abstract base for inference sampling backends. |
API#
- class core.inference.sampling.base.Sampling#
Bases:
abc.ABCAbstract base for inference sampling backends.
Subclasses implement
sample_kernel. CUDA graphs are added viaCudaGraphManager.- abstractmethod sample_kernel(
- logits: torch.Tensor,
- n: int,
- context,
- *,
- gather_indices: Optional[torch.Tensor] = None,
- token_to_request_index: Optional[torch.Tensor] = None,
- eager: bool = False,
- cache_key: Any = None,
Sample
ntokens fromlogitsand return them.- Parameters:
logits – Logits tensor of shape
[>=n, vocab_size].n – Number of rows to sample.
context – The active DynamicInferenceContext.
gather_indices – If provided, only sample from
logits[gather_indices[:n], :].token_to_request_index – Per-token request mapping; when set, sampling parameters are gathered per-token instead of per-request.
eager – Consumed by
CudaGraphManagerwhen it wraps this kernel.cache_key – Consumed by
CudaGraphManagerwhen it wraps this kernel.
- Returns:
Sampled token ids of shape
[n]. Under CUDA graph replay, this is a static buffer.
- sample_speculative(
- required_logits: torch.Tensor,
- num_decode: int,
- num_prefill: int,
- num_speculative_tokens: int,
- context,
- *,
- gather_indices: Optional[torch.Tensor] = None,
- eager: bool = False,
- cache_key: Any = None,
Sample tokens for the speculative-verify path.
Decode requests contribute
1 + num_speculative_tokensrows; prefill requests contribute 1. Builds the per-token request mapping and dispatches tosample_kernel. Thesample_kernelis forced eager so its ownCudaGraphManagerwrapper does not fire.When
gather_indicesis supplied, the kernel selects vialogits[gather_indices[:n], :]. Whengather_indicesis None,required_logitsis expected to be already pre-gathered to the layout described above (e.g. whenmaterialize_only_last_token_logits=Trueupstream).