core.inference.moe#

Submodules#

Package Contents#

Classes#

InferenceGroupedGemmBackend

Resolved backend for grouped GEMM operations during inference.

Functions#

resolve_inference_grouped_gemm_backend

Resolve the grouped GEMM backend to use for the current iteration.

API#

class core.inference.moe.InferenceGroupedGemmBackend(*args, **kwds)#

Bases: enum.Enum

Resolved backend for grouped GEMM operations during inference.

Initialization

FLASHINFER#

‘flashinfer’

TORCH#

‘torch’

TE#

‘te’

core.inference.moe.resolve_inference_grouped_gemm_backend(
backend: str,
is_cuda_graphed: bool,
is_mxfp8: bool = False,
) core.inference.moe.InferenceGroupedGemmBackend#

Resolve the grouped GEMM backend to use for the current iteration.

Prerequisites are validated at init time in MoELayer; this function simply maps (backend, is_cuda_graphed) to the concrete backend enum.

Parameters:
  • backend – One of ‘auto’, ‘torch’, ‘te’.

  • is_cuda_graphed – Whether this is a CUDA-graphed iteration.

  • is_mxfp8 – Whether the model is using MXFP8 quantization (affects auto backend choice).

Returns:

An InferenceGroupedGemmBackend enum value.