core.inference.moe#
Submodules#
Package Contents#
Classes#
Resolved backend for grouped GEMM operations during inference. |
Functions#
Resolve the grouped GEMM backend to use for the current iteration. |
API#
- class core.inference.moe.InferenceGroupedGemmBackend(*args, **kwds)#
Bases:
enum.EnumResolved backend for grouped GEMM operations during inference.
Initialization
- FLASHINFER#
‘flashinfer’
- TORCH#
‘torch’
- TE#
‘te’
- core.inference.moe.resolve_inference_grouped_gemm_backend(
- backend: str,
- is_cuda_graphed: bool,
- is_mxfp8: bool = False,
Resolve the grouped GEMM backend to use for the current iteration.
Prerequisites are validated at init time in MoELayer; this function simply maps (backend, is_cuda_graphed) to the concrete backend enum.
- Parameters:
backend – One of ‘auto’, ‘torch’, ‘te’.
is_cuda_graphed – Whether this is a CUDA-graphed iteration.
is_mxfp8 – Whether the model is using MXFP8 quantization (affects auto backend choice).
- Returns:
An InferenceGroupedGemmBackend enum value.