core.inference.batch_dimensions_utils#
Batch dimensions utilities.
This module contains utilities for managing batch dimensions, including the InferenceBatchDimensions dataclass and CUDAGraphBatchDimensionBuilder for generating and matching CUDA graph batch dimensions.
Module Contents#
Classes#
Batch dimensions for dynamic inference. |
|
Builder for creating and managing CUDA graph batch dimensions. |
API#
- class core.inference.batch_dimensions_utils.InferenceBatchDimensions#
Batch dimensions for dynamic inference.
.. attribute:: token_count
number of total input tokens
.. attribute:: prefill_req_count
number of prefill requests
.. attribute:: decode_req_count
number of decode requests
The batch dimensions are ordered by token_count, then by prefill_req_count, then by decode_req_count.
- token_count: int#
0
- prefill_req_count: int#
0
- decode_req_count: int#
0
- __str__()#
Returns a string representation of the batch dimensions.
- is_applicable_for_batch_dim(
- real_batch_dim: core.inference.batch_dimensions_utils.InferenceBatchDimensions,
- strict: bool = False,
Checks if this batch dimension is applicable for the given real batch dimension. Applicable batch dimensions are those that have enough tokens and requests budget to handle the real batch dimensions.
Note that if strict is False, prefill slots can be used for prefill or decode requests. Otherwise, prefill slots can only be used for prefill requests.
- is_valid(max_requests: int, max_sequence_length: int) bool#
Checks if the batch dimension is valid based on resource constraints.
- Parameters:
max_requests – Maximum number of requests allowed
- Returns:
True if the config is valid, False otherwise
- __hash__()#
Returns a hash of the batch dimension. In cuda graph quick matching, the batch dimension is used as a key in a dictionary.
- __eq__( ) bool#
Checks if this batch dimension is equal to another batch dimension.
- property req_count: int#
Returns the total number of requests.
- static adjust_batch_dims_for_expert_parallelism(
- local_batch_dims,
- decode_only_cuda_graphs: bool,
Adjusted cuda graph batch dimensions for expert parallelism. We take the max token count across expert model parallel group.
- Returns:
(InferenceBatchDimensions) A new InferenceBatchDimensions object with adjusted dimensions.
- class core.inference.batch_dimensions_utils.CUDAGraphBatchDimensionBuilder#
Builder for creating and managing CUDA graph batch dimensions.
This class provides static methods for generating lists of CUDA graph batch dimensions and matching the best batch dimension for a given real batch dimension.
- CUDA_GRAPH_ROUNDER#
8
- static _calculate_cuda_graph_token_counts(
- tp_size: int,
- num_cuda_graphs: int,
- cuda_graph_max_tokens: int,
Calculate CUDA graph token counts for a given configuration.
This method computes evenly-spaced token counts from step_size up to cuda_graph_max_tokens, ensuring proper rounding and TP alignment.
- Parameters:
tp_size – Tensor parallel size (for alignment)
num_cuda_graphs – Number of CUDA graphs to generate (must be >= 1)
cuda_graph_max_tokens – Maximum token count for CUDA graphs (must be > 0)
- Returns:
List of token counts in descending order
.. rubric:: Example
_calculate_cuda_graph_token_counts (tp_size=2, num_cuda_graphs=4, cuda_graph_max_tokens=1000) [1000, 752, 504, 256]
- static generate_cuda_graph_batch_dimensions_list(
- tp_size: int,
- num_cuda_graphs: Optional[int],
- cuda_graph_max_tokens: int,
- cuda_graph_mixed_prefill_count: Optional[int],
- max_requests: int,
- max_tokens: int,
- max_sequence_length: int,
- use_cuda_graphs_for_non_decode_steps: bool,
Generate CUDA graph batch dimensions.
This function constructs CUDA graph batch dimensions for different token counts and request patterns, then filters them based on resource constraints. The construction process involves:
Construction Rules:
Token count generation: Creates token counts from step_size to max_tokens, rounded to multiples of 8
Tensor parallelism alignment: Ensures step_size is divisible by tensor parallel size
Batch dimension creation: For each token count, creates three types of batch dimensions:
Decode-only: (token_count, 0, token_count) - all tokens used for decode requests
Mixed prefill+decode: (token_count, prefill_req_count, token_count - prefill_req_count)
Prefill-only: (token_count, max(prefill_req_count, ceil(token_count/(max_seq_len-1))), 0)
Filtering Rules:
Request limit: prefill_req_count + decode_req_count <= max_requests
Non-negative counts: Both prefill_req_count and decode_req_count must be >= 0
Token sufficiency: token_count >= prefill_req_count + decode_req_count
Sorting Rules for Attention Metadata Construction:
Batch dimensions are sorted by prefill token count (token_count - decode_req_count) in descending order
- Parameters:
tp_size – Tensor parallel size
num_cuda_graphs – Number of CUDA graphs to generate
cuda_graph_max_tokens – Maximum tokens for CUDA graphs
cuda_graph_mixed_prefill_count – Number of mixed prefill requests for CUDA graphs
max_requests – Maximum number of requests
max_tokens – Maximum total tokens
max_sequence_length – Maximum sequence length
use_cuda_graphs_for_non_decode_steps – Whether to use CUDA graphs for non-decode steps
- Returns:
List of InferenceBatchDimensions objects, sorted by prefill token count in descending order
Optional list of CUDA graph token counts
- Return type:
Tuple containing
- static match_graph_config(
- real_batch_dim: core.inference.batch_dimensions_utils.InferenceBatchDimensions,
- cuda_graph_batch_dimensions_list: List[core.inference.batch_dimensions_utils.InferenceBatchDimensions],
- strict: bool = False,
- decode_only_cuda_graphs: bool = False,
Matches the best CUDA graph batch dimension for the given real batch dimension.
- Parameters:
real_batch_dim – The real batch dimension to match
cuda_graph_batch_dimensions_list – List of available CUDA graph batch dimensions
strict – If False, prefill slots can be used for prefill or decode requests. If True, prefill slots can only be used for prefill requests.
decode_only_cuda_graphs – Used by expert parallel matching. If this is true,
step (and one of the EP ranks is running a non-decode)
in (we elect to run)
graph. (eager mode instead of matching a decode-only cuda)
- Returns:
The best matching CUDA graph batch dimension, or None if no applicable match is found