core.inference.engines.static_engine#

Module Contents#

Classes#

StaticInferenceEngine

The Megatron core backend constructor

API#

class core.inference.engines.static_engine.StaticInferenceEngine(
text_generation_controller: megatron.core.inference.text_generation_controllers.text_generation_controller.TextGenerationController,
max_batch_size: Optional[int] = None,
random_seed: Optional[int] = None,
legacy=False,
buffer_size_gb: Optional[float] = 40,
)#

Bases: megatron.core.inference.engines.abstract_engine.AbstractEngine

The Megatron core backend constructor

This is the backend that does a simple forward pass on the model. Supports any model that is callable (Accepts the inputs and outputs the tensor)

Parameters:
  • text_generation_controller (TextGenerationController) – A text generation controller that will be used to define how to preprocess prompts, generate outputs and detokenizer the output tokens.

  • max_batch_size (int, optional) – The maximum number of requests to process at once. Will be set from the InferenceWrapperConfig in text_generation_controller by default.

  • random_seed (int, optional) – Use a random seed if you want deterministic results. Defaults to None.

Initialization

get_new_request_id() str#

Gets a new request id from the scheduler

add_request(
prompt: Optional[str] = None,
add_BOS: bool = False,
encoder_prompt: Optional[str] = None,
sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
streaming: bool = False,
inference_request: Optional[megatron.core.inference.inference_request.InferenceRequest] = None,
*,
inference_parameters: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
) int#

Adds a request to the scheduler and returns the request ID.

Parameters:
  • prompt (str) – A prompt string

  • add_BOS (bool) – Whether to add BOS token to beginning of the prompt

  • encoder_prompt (str) – The encoder prompt string

  • sampling_params (SamplingParams) – The inference parameters

  • streaming (bool) – Whether to stream incremental outputs for this request

  • inference_request (InferenceRequest, optional) – A fully constructed request. Defaults to None.

  • inference_parameters (SamplingParams, optional) – Deprecated and renamed to SamplingParams.

Returns:

The newly created request ID.

get_stream_generator(
request_id: int,
) Union[AsyncGenerator[megatron.core.inference.inference_request.InferenceRequest, None], None]#

Returns the stream generator for the given request ID if it exists.

generate_using_dynamic_engine(
prompts: Optional[List[str]] = None,
add_BOS: bool = False,
encoder_prompts: Optional[List[str]] = None,
common_inference_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
inference_requests: Optional[List[megatron.core.inference.inference_request.InferenceRequest]] = None,
) List[megatron.core.inference.inference_request.InferenceRequest]#

Generate using dynamic engine

Generate using dynamic engine.

Parameters:
  • prompts (List[str]) – All the prompts as a list of strings

  • add_BOS (bool) – Whether to add BOS token to beginning of prompts

  • encoder_prompts (List[dict]) – All the encoder prompts as a list of strings

  • common_inference_params – Deprecated. Only used for backward compatibility with

  • forward. (MCore <= 0.9.0. Use sampling_params going)

  • sampling_params (SamplingParams) – The request-level sampling parameters

  • inference_requests (List[InferenceRequest]) – A pre-populated list of inference requests

Returns:

The output is list of inference requests containing the generated tokens, texts and log probs if required

Return type:

List[InferenceRequest]

generate_using_legacy_static_engine(
prompts: Optional[List[str]] = None,
add_BOS: bool = False,
encoder_prompts: Optional[List[str]] = None,
common_inference_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
inference_requests: Optional[List[megatron.core.inference.inference_request.InferenceRequest]] = None,
) List[megatron.core.inference.inference_request.InferenceRequest]#

The megatron core inference backend generate function

This backend returns the output generations as a list.

Parameters:
  • prompts (List[str]) – All the prompts as a list of strings

  • add_BOS (bool) – Whether to add BOS token to beginning of prompts

  • encoder_prompts (List[dict]) – All the encoder prompts as a list of strings

  • common_inference_params – Deprecated. Only used for backward compatibility with

  • forward. (MCore <= 0.9.0. Use sampling_params going)

  • sampling_params (SamplingParams) – The request-level sampling parameters

  • inference_requests (List[InferenceRequest]) – A pre-populated list of inference requests

Returns:

The output is list of inference requests containing the generated tokens, texts and log probs if required

Return type:

List[InferenceRequest]

generate(
prompts: Optional[List[str]] = None,
add_BOS: bool = False,
encoder_prompts: Optional[List[str]] = None,
common_inference_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
inference_requests: Optional[List[megatron.core.inference.inference_request.InferenceRequest]] = None,
) List[megatron.core.inference.inference_request.InferenceRequest]#

The megatron core inference backend generate function

This uses dynamic engine if available, otherwise uses legacy static engine.

Parameters:
  • prompts (List[str]) – All the prompts as a list of strings

  • add_BOS (bool) – Whether to add BOS token to beginning of prompts

  • encoder_prompts (List[dict]) – All the encoder prompts as a list of strings

  • common_inference_params – Deprecated. Only used for backward compatibility with

  • forward. (MCore <= 0.9.0. Use sampling_params going)

  • sampling_params (SamplingParams) – The request-level sampling parameters

  • inference_requests (List[InferenceRequest]) – A pre-populated list of inference requests

Returns:

The output is list of inference requests containing the generated tokens, texts and log probs if required

Return type:

List[InferenceRequest]

run_engine()#

Main functionality to run inference

Runs the engine until there are no requests in the queue.

Parameters:

dynamic_generation (bool, optional) – Set this to True, if you want to enable dynamic batching. Mainly used with an inference server. Defaults to False.

_wrapped_run_engine(cuda_device)#

Explicitly sets the CUDA device before running the engine.

This is to ensure that the CUDA device is correctly propagated when running in a new thread context.

async run_engine_async(
loop: Optional[asyncio.AbstractEventLoop] = None,
)#

Runs the engine asynchronously using asyncio