core.inference.engines.static_engine#
Module Contents#
Classes#
The Megatron core backend constructor |
API#
- class core.inference.engines.static_engine.StaticInferenceEngine(
- text_generation_controller: megatron.core.inference.text_generation_controllers.text_generation_controller.TextGenerationController,
- max_batch_size: Optional[int] = None,
- random_seed: Optional[int] = None,
- legacy=False,
- buffer_size_gb: Optional[float] = 40,
Bases:
megatron.core.inference.engines.abstract_engine.AbstractEngineThe Megatron core backend constructor
This is the backend that does a simple forward pass on the model. Supports any model that is callable (Accepts the inputs and outputs the tensor)
- Parameters:
text_generation_controller (TextGenerationController) – A text generation controller that will be used to define how to preprocess prompts, generate outputs and detokenizer the output tokens.
max_batch_size (int, optional) – The maximum number of requests to process at once. Will be set from the InferenceWrapperConfig in
text_generation_controllerby default.random_seed (int, optional) – Use a random seed if you want deterministic results. Defaults to None.
Initialization
- get_new_request_id() str#
Gets a new request id from the scheduler
- add_request(
- prompt: Optional[str] = None,
- add_BOS: bool = False,
- encoder_prompt: Optional[str] = None,
- sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
- streaming: bool = False,
- inference_request: Optional[megatron.core.inference.inference_request.InferenceRequest] = None,
- *,
- inference_parameters: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
Adds a request to the scheduler and returns the request ID.
- Parameters:
prompt (str) – A prompt string
add_BOS (bool) – Whether to add BOS token to beginning of the prompt
encoder_prompt (str) – The encoder prompt string
sampling_params (SamplingParams) – The inference parameters
streaming (bool) – Whether to stream incremental outputs for this request
inference_request (InferenceRequest, optional) – A fully constructed request. Defaults to None.
inference_parameters (SamplingParams, optional) – Deprecated and renamed to
SamplingParams.
- Returns:
The newly created request ID.
- get_stream_generator(
- request_id: int,
Returns the stream generator for the given request ID if it exists.
- generate_using_dynamic_engine(
- prompts: Optional[List[str]] = None,
- add_BOS: bool = False,
- encoder_prompts: Optional[List[str]] = None,
- common_inference_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
- sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
- inference_requests: Optional[List[megatron.core.inference.inference_request.InferenceRequest]] = None,
Generate using dynamic engine
Generate using dynamic engine.
- Parameters:
prompts (List[str]) – All the prompts as a list of strings
add_BOS (bool) – Whether to add BOS token to beginning of prompts
encoder_prompts (List[dict]) – All the encoder prompts as a list of strings
common_inference_params – Deprecated. Only used for backward compatibility with
forward. (MCore <= 0.9.0. Use sampling_params going)
sampling_params (SamplingParams) – The request-level sampling parameters
inference_requests (List[InferenceRequest]) – A pre-populated list of inference requests
- Returns:
The output is list of inference requests containing the generated tokens, texts and log probs if required
- Return type:
List[InferenceRequest]
- generate_using_legacy_static_engine(
- prompts: Optional[List[str]] = None,
- add_BOS: bool = False,
- encoder_prompts: Optional[List[str]] = None,
- common_inference_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
- sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
- inference_requests: Optional[List[megatron.core.inference.inference_request.InferenceRequest]] = None,
The megatron core inference backend generate function
This backend returns the output generations as a list.
- Parameters:
prompts (List[str]) – All the prompts as a list of strings
add_BOS (bool) – Whether to add BOS token to beginning of prompts
encoder_prompts (List[dict]) – All the encoder prompts as a list of strings
common_inference_params – Deprecated. Only used for backward compatibility with
forward. (MCore <= 0.9.0. Use sampling_params going)
sampling_params (SamplingParams) – The request-level sampling parameters
inference_requests (List[InferenceRequest]) – A pre-populated list of inference requests
- Returns:
The output is list of inference requests containing the generated tokens, texts and log probs if required
- Return type:
List[InferenceRequest]
- generate(
- prompts: Optional[List[str]] = None,
- add_BOS: bool = False,
- encoder_prompts: Optional[List[str]] = None,
- common_inference_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
- sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
- inference_requests: Optional[List[megatron.core.inference.inference_request.InferenceRequest]] = None,
The megatron core inference backend generate function
This uses dynamic engine if available, otherwise uses legacy static engine.
- Parameters:
prompts (List[str]) – All the prompts as a list of strings
add_BOS (bool) – Whether to add BOS token to beginning of prompts
encoder_prompts (List[dict]) – All the encoder prompts as a list of strings
common_inference_params – Deprecated. Only used for backward compatibility with
forward. (MCore <= 0.9.0. Use sampling_params going)
sampling_params (SamplingParams) – The request-level sampling parameters
inference_requests (List[InferenceRequest]) – A pre-populated list of inference requests
- Returns:
The output is list of inference requests containing the generated tokens, texts and log probs if required
- Return type:
List[InferenceRequest]
- run_engine()#
Main functionality to run inference
Runs the engine until there are no requests in the queue.
- Parameters:
dynamic_generation (bool, optional) – Set this to True, if you want to enable dynamic batching. Mainly used with an inference server. Defaults to False.
- _wrapped_run_engine(cuda_device)#
Explicitly sets the CUDA device before running the engine.
This is to ensure that the CUDA device is correctly propagated when running in a new thread context.
- async run_engine_async(
- loop: Optional[asyncio.AbstractEventLoop] = None,
Runs the engine asynchronously using asyncio