core.inference.contexts.static_context#

Module Contents#

Classes#

StaticInferenceContext

Static inference context that is passed to the main model in order to efficiently manage the KV cache during inference.

API#

class core.inference.contexts.static_context.StaticInferenceContext(
max_batch_size: int,
max_sequence_length: int,
use_flashinfer_fused_rope: bool = None,
)#

Bases: core.inference.contexts.base_context.BaseInferenceContext

Static inference context that is passed to the main model in order to efficiently manage the KV cache during inference.

Parameters:
  • max_batch_size (int) – Max supported batch size.

  • max_sequence_length (int) – Max supported sequence length.

Initialization

Parameters:

materialize_only_last_token_logits (bool) – If True, only the last-token logits will be extracted during decode

classmethod from_config(
config: megatron.core.inference.model_inference_wrappers.inference_wrapper_config.InferenceWrapperConfig,
) core.inference.contexts.static_context.StaticInferenceContext#

Initialize context from a config.

swap_key_value_dict(batch_idx)#

swap between batches

enable_prefill_mode()#

Indicates the generation loop is in the prefill phase (still processing input prompt tokens). This should be enabled if the generation loop is encoding prompt tokens for any request in a batch.

enable_decode_mode()#

Indicates the generation loop is in the decode phase (generating new output tokens). This should only be enabled if the generation loop has fully encoded the prompts for all requests in a batch.

is_decode_only()#

Functional access to .decode_mode, to match dynamic context.

reset()#

Resets the inference state for a new batch.

__str__()#
__eq__(other)#
is_static_batching()#