`core.inference.engines.static_engine`#

Module Contents#

Classes#

StaticInferenceEngine

The Megatron core backend constructor

API#

class core.inference.engines.static_engine.StaticInferenceEngine( text_generation_controller: megatron.core.inference.text_generation_controllers.text_generation_controller.TextGenerationController, max_batch_size: Optional[int] = None, random_seed: Optional[int] = None, legacy=False, buffer_size_gb: Optional[float] = 40, )#

Bases: megatron.core.inference.engines.abstract_engine.AbstractEngine

The Megatron core backend constructor

This is the backend that does a simple forward pass on the model. Supports any model that is callable (Accepts the inputs and outputs the tensor)

Parameters:

text_generation_controller (TextGenerationController) – A text generation controller that will be used to define how to preprocess prompts, generate outputs and detokenizer the output tokens.
max_batch_size (int, optional) – The maximum number of requests to process at once. Will be set from the InferenceWrapperConfig in text_generation_controller by default.
random_seed (int, optional) – Use a random seed if you want deterministic results. Defaults to None.

Initialization

get_new_request_id() → str#: Gets a new request id from the scheduler

add_request( prompt: Optional[str] = None, add_BOS: bool = False, encoder_prompt: Optional[str] = None, sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None, streaming: bool = False, inference_request: Optional[megatron.core.inference.inference_request.InferenceRequest] = None, *, inference_parameters: Optional[megatron.core.inference.sampling_params.SamplingParams] = None, ) → int#

Adds a request to the scheduler and returns the request ID.

Parameters:

prompt (str) – A prompt string
add_BOS (bool) – Whether to add BOS token to beginning of the prompt
encoder_prompt (str) – The encoder prompt string
sampling_params (SamplingParams) – The inference parameters
streaming (bool) – Whether to stream incremental outputs for this request
inference_request (InferenceRequest, optional) – A fully constructed request. Defaults to None.
inference_parameters (SamplingParams, optional) – Deprecated and renamed to SamplingParams.

Returns:

The newly created request ID.

get_stream_generator( request_id: int, ) → Union[AsyncGenerator[megatron.core.inference.inference_request.InferenceRequest, None], None]#: Returns the stream generator for the given request ID if it exists.

generate_using_dynamic_engine( prompts: Optional[List[str]] = None, add_BOS: bool = False, encoder_prompts: Optional[List[str]] = None, common_inference_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None, sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None, inference_requests: Optional[List[megatron.core.inference.inference_request.InferenceRequest]] = None, ) → List[megatron.core.inference.inference_request.InferenceRequest]#

Generate using dynamic engine

Generate using dynamic engine.

Parameters:

prompts (List[str]) – All the prompts as a list of strings
add_BOS (bool) – Whether to add BOS token to beginning of prompts
encoder_prompts (List[dict]) – All the encoder prompts as a list of strings
common_inference_params – Deprecated. Only used for backward compatibility with
forward. (MCore <= 0.9.0. Use sampling_params going)
sampling_params (SamplingParams) – The request-level sampling parameters
inference_requests (List[InferenceRequest]) – A pre-populated list of inference requests

Returns:

The output is list of inference requests containing the generated tokens, texts and log probs if required

Return type:

List[InferenceRequest]

generate_using_legacy_static_engine( prompts: Optional[List[str]] = None, add_BOS: bool = False, encoder_prompts: Optional[List[str]] = None, common_inference_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None, sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None, inference_requests: Optional[List[megatron.core.inference.inference_request.InferenceRequest]] = None, ) → List[megatron.core.inference.inference_request.InferenceRequest]#

The megatron core inference backend generate function

This backend returns the output generations as a list.

Parameters:

prompts (List[str]) – All the prompts as a list of strings
add_BOS (bool) – Whether to add BOS token to beginning of prompts
encoder_prompts (List[dict]) – All the encoder prompts as a list of strings
common_inference_params – Deprecated. Only used for backward compatibility with
forward. (MCore <= 0.9.0. Use sampling_params going)
sampling_params (SamplingParams) – The request-level sampling parameters
inference_requests (List[InferenceRequest]) – A pre-populated list of inference requests

Returns:

The output is list of inference requests containing the generated tokens, texts and log probs if required

Return type:

List[InferenceRequest]

generate( prompts: Optional[List[str]] = None, add_BOS: bool = False, encoder_prompts: Optional[List[str]] = None, common_inference_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None, sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None, inference_requests: Optional[List[megatron.core.inference.inference_request.InferenceRequest]] = None, ) → List[megatron.core.inference.inference_request.InferenceRequest]#

The megatron core inference backend generate function

This uses dynamic engine if available, otherwise uses legacy static engine.

Parameters:

prompts (List[str]) – All the prompts as a list of strings
add_BOS (bool) – Whether to add BOS token to beginning of prompts
encoder_prompts (List[dict]) – All the encoder prompts as a list of strings
common_inference_params – Deprecated. Only used for backward compatibility with
forward. (MCore <= 0.9.0. Use sampling_params going)
sampling_params (SamplingParams) – The request-level sampling parameters
inference_requests (List[InferenceRequest]) – A pre-populated list of inference requests

Returns:

The output is list of inference requests containing the generated tokens, texts and log probs if required

Return type:

List[InferenceRequest]

run_engine()#

Main functionality to run inference

Runs the engine until there are no requests in the queue.

Parameters:: dynamic_generation (bool, optional) – Set this to True, if you want to enable dynamic batching. Mainly used with an inference server. Defaults to False.

_wrapped_run_engine(cuda_device)#

Explicitly sets the CUDA device before running the engine.

This is to ensure that the CUDA device is correctly propagated when running in a new thread context.

async run_engine_async( loop: Optional[asyncio.AbstractEventLoop] = None, )#: Runs the engine asynchronously using asyncio

core.inference.engines.static_engine#

Module Contents#

Classes#

API#

`core.inference.engines.static_engine`#