core.inference.batch_dimensions_utils#

Batch dimensions utilities.

This module contains utilities for managing batch dimensions, including the InferenceBatchDimensions dataclass and CUDAGraphBatchDimensionBuilder for generating and matching CUDA graph batch dimensions.

Module Contents#

Classes#

InferenceBatchDimensions

Batch dimensions for dynamic inference.

CUDAGraphBatchDimensionBuilder

Builder for creating and managing CUDA graph batch dimensions.

API#

class core.inference.batch_dimensions_utils.InferenceBatchDimensions#

Batch dimensions for dynamic inference.

.. attribute:: token_count

number of total input tokens

.. attribute:: prefill_req_count

number of prefill requests

.. attribute:: decode_req_count

number of decode requests

The batch dimensions are ordered by token_count, then by prefill_req_count, then by decode_req_count.

token_count: int#

0

prefill_req_count: int#

0

decode_req_count: int#

0

__str__()#

Returns a string representation of the batch dimensions.

is_applicable_for_batch_dim(
real_batch_dim: core.inference.batch_dimensions_utils.InferenceBatchDimensions,
strict: bool = False,
) bool#

Checks if this batch dimension is applicable for the given real batch dimension. Applicable batch dimensions are those that have enough tokens and requests budget to handle the real batch dimensions.

Note that if strict is False, prefill slots can be used for prefill or decode requests. Otherwise, prefill slots can only be used for prefill requests.

is_valid(max_requests: int, max_sequence_length: int) bool#

Checks if the batch dimension is valid based on resource constraints.

Parameters:

max_requests – Maximum number of requests allowed

Returns:

True if the config is valid, False otherwise

__hash__()#

Returns a hash of the batch dimension. In cuda graph quick matching, the batch dimension is used as a key in a dictionary.

__eq__(
other: core.inference.batch_dimensions_utils.InferenceBatchDimensions,
) bool#

Checks if this batch dimension is equal to another batch dimension.

property req_count: int#

Returns the total number of requests.

static adjust_batch_dims_for_expert_parallelism(
local_batch_dims,
decode_only_cuda_graphs: bool,
) core.inference.batch_dimensions_utils.InferenceBatchDimensions#

Adjusted cuda graph batch dimensions for expert parallelism. We take the max token count across expert model parallel group.

Returns:

(InferenceBatchDimensions) A new InferenceBatchDimensions object with adjusted dimensions.

class core.inference.batch_dimensions_utils.CUDAGraphBatchDimensionBuilder#

Builder for creating and managing CUDA graph batch dimensions.

This class provides static methods for generating lists of CUDA graph batch dimensions and matching the best batch dimension for a given real batch dimension.

CUDA_GRAPH_ROUNDER#

8

static _calculate_cuda_graph_token_counts(
tp_size: int,
num_cuda_graphs: int,
cuda_graph_max_tokens: int,
) List[int]#

Calculate CUDA graph token counts for a given configuration.

This method computes evenly-spaced token counts from step_size up to cuda_graph_max_tokens, ensuring proper rounding and TP alignment.

Parameters:
  • tp_size – Tensor parallel size (for alignment)

  • num_cuda_graphs – Number of CUDA graphs to generate (must be >= 1)

  • cuda_graph_max_tokens – Maximum token count for CUDA graphs (must be > 0)

Returns:

List of token counts in descending order

.. rubric:: Example

_calculate_cuda_graph_token_counts (tp_size=2, num_cuda_graphs=4, cuda_graph_max_tokens=1000) [1000, 752, 504, 256]

static generate_cuda_graph_batch_dimensions_list(
tp_size: int,
num_cuda_graphs: Optional[int],
cuda_graph_max_tokens: int,
cuda_graph_mixed_prefill_count: Optional[int],
max_requests: int,
max_tokens: int,
max_sequence_length: int,
use_cuda_graphs_for_non_decode_steps: bool,
) Tuple[List[core.inference.batch_dimensions_utils.InferenceBatchDimensions], Optional[List[int]]]#

Generate CUDA graph batch dimensions.

This function constructs CUDA graph batch dimensions for different token counts and request patterns, then filters them based on resource constraints. The construction process involves:

Construction Rules:

  1. Token count generation: Creates token counts from step_size to max_tokens, rounded to multiples of 8

  2. Tensor parallelism alignment: Ensures step_size is divisible by tensor parallel size

  3. Batch dimension creation: For each token count, creates three types of batch dimensions:

    • Decode-only: (token_count, 0, token_count) - all tokens used for decode requests

    • Mixed prefill+decode: (token_count, prefill_req_count, token_count - prefill_req_count)

    • Prefill-only: (token_count, max(prefill_req_count, ceil(token_count/(max_seq_len-1))), 0)

Filtering Rules:

  1. Request limit: prefill_req_count + decode_req_count <= max_requests

  2. Non-negative counts: Both prefill_req_count and decode_req_count must be >= 0

  3. Token sufficiency: token_count >= prefill_req_count + decode_req_count

Sorting Rules for Attention Metadata Construction:

  1. Batch dimensions are sorted by prefill token count (token_count - decode_req_count) in descending order

Parameters:
  • tp_size – Tensor parallel size

  • num_cuda_graphs – Number of CUDA graphs to generate

  • cuda_graph_max_tokens – Maximum tokens for CUDA graphs

  • cuda_graph_mixed_prefill_count – Number of mixed prefill requests for CUDA graphs

  • max_requests – Maximum number of requests

  • max_tokens – Maximum total tokens

  • max_sequence_length – Maximum sequence length

  • use_cuda_graphs_for_non_decode_steps – Whether to use CUDA graphs for non-decode steps

Returns:

  • List of InferenceBatchDimensions objects, sorted by prefill token count in descending order

  • Optional list of CUDA graph token counts

Return type:

Tuple containing

static match_graph_config(
real_batch_dim: core.inference.batch_dimensions_utils.InferenceBatchDimensions,
cuda_graph_batch_dimensions_list: List[core.inference.batch_dimensions_utils.InferenceBatchDimensions],
strict: bool = False,
decode_only_cuda_graphs: bool = False,
) Optional[core.inference.batch_dimensions_utils.InferenceBatchDimensions]#

Matches the best CUDA graph batch dimension for the given real batch dimension.

Parameters:
  • real_batch_dim – The real batch dimension to match

  • cuda_graph_batch_dimensions_list – List of available CUDA graph batch dimensions

  • strict – If False, prefill slots can be used for prefill or decode requests. If True, prefill slots can only be used for prefill requests.

  • decode_only_cuda_graphs – Used by expert parallel matching. If this is true,

  • step (and one of the EP ranks is running a non-decode)

  • in (we elect to run)

  • graph. (eager mode instead of matching a decode-only cuda)

Returns:

The best matching CUDA graph batch dimension, or None if no applicable match is found