`core.inference.scheduler`#

Module Contents#

Classes#

Scheduler

Scheduler for handling requests to inference engine

API#

class core.inference.scheduler.Scheduler(max_batch_size)#

Scheduler for handling requests to inference engine

This class is responsible for handing of all the incomign requests

Parameters:

max_batch_size (int) – The max batch size that we can pass to the inference engine at a time.
request_type (InferenceRequest) – The class to use for instantiating new requests.

Initialization

get_new_request_id() → int#: Gets a new request id

add_request( prompt: Optional[str] = None, prompt_tokens: Optional[torch.Tensor] = None, encoder_prompt: Optional[str] = None, sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None, arrival_time: Optional[float] = None, streaming: bool = False, inference_request: Optional[megatron.core.inference.inference_request.InferenceRequest] = None, *, inference_parameters: Optional[megatron.core.inference.sampling_params.SamplingParams] = None, ) → int#

Add an incoming request

This method will add the request to either the active pool or the waiting pool depending on the batch size.

Parameters:

prompt (str) – Input prompt string
prompt_tokens (torch.Tensor) – A torch tensor having the input prompts tokenized
encoder_prompt (str) – Encoder input string
sampling_params (SamplingParams) – The sampling parameters
arrival_time (float, optional) – The incoming request time. Defaults to None.
streaming (bool, optional) – Whether to asynchronously stream tokens for this request.
inference_request (InferenceRequest, optional) – A fully constructed request. Defaults to None.

Returns:

The request_id for the new request.

num_requests_pending() → int#

Get the number of requests pending.

This method returns the number of active + waiting requests.

have_requests_pending() → bool#

Method to check if there are requests pending.

This method returns False only when there are no active requests or waiting requests.

add_earliest_waiting_request_to_active_pool()#

Utility to add the waiting request to active pool

This method will add the earliest request (FIFO) that is in the waiting request pool to the active request pool.

update_requests_pools( result_dict: Optional[OrderedDict[int, megatron.core.inference.inference_request.InferenceRequest]] = None, )#

Update request pool status

This method will full up the active request pool, if it has less than max batch size elements from the waiting request pool. If provided with a request dict, it will put the completed requests into the completed request pool and add waiting request into active pool.

Parameters:: result (OrderedDict[int, InferenceRequest], optional) – The result returned by the engine. A dictionary with keys as the request ids, and values as the requests. Defaults to None.

abort_request( request_id: int, *, exception: Optional[Union[BaseException, Type[BaseException]]] = None, )#: Cancels the given request

core.inference.scheduler#

Module Contents#

Classes#

API#

`core.inference.scheduler`#