core.inference.scheduler#

Module Contents#

Classes#

Scheduler

Scheduler for handling requests to inference engine

API#

class core.inference.scheduler.Scheduler(max_batch_size)#

Scheduler for handling requests to inference engine

This class is responsible for handing of all the incomign requests

Parameters:
  • max_batch_size (int) – The max batch size that we can pass to the inference engine at a time.

  • request_type (InferenceRequest) – The class to use for instantiating new requests.

Initialization

get_new_request_id() int#

Gets a new request id

add_request(
prompt: Optional[str] = None,
prompt_tokens: Optional[torch.Tensor] = None,
encoder_prompt: Optional[str] = None,
sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
arrival_time: Optional[float] = None,
streaming: bool = False,
inference_request: Optional[megatron.core.inference.inference_request.InferenceRequest] = None,
*,
inference_parameters: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
) int#

Add an incoming request

This method will add the request to either the active pool or the waiting pool depending on the batch size.

Parameters:
  • prompt (str) – Input prompt string

  • prompt_tokens (torch.Tensor) – A torch tensor having the input prompts tokenized

  • encoder_prompt (str) – Encoder input string

  • sampling_params (SamplingParams) – The sampling parameters

  • arrival_time (float, optional) – The incoming request time. Defaults to None.

  • streaming (bool, optional) – Whether to asynchronously stream tokens for this request.

  • inference_request (InferenceRequest, optional) – A fully constructed request. Defaults to None.

Returns:

The request_id for the new request.

num_requests_pending() int#

Get the number of requests pending.

This method returns the number of active + waiting requests.

have_requests_pending() bool#

Method to check if there are requests pending.

This method returns False only when there are no active requests or waiting requests.

add_earliest_waiting_request_to_active_pool()#

Utility to add the waiting request to active pool

This method will add the earliest request (FIFO) that is in the waiting request pool to the active request pool.

update_requests_pools(
result_dict: Optional[OrderedDict[int, megatron.core.inference.inference_request.InferenceRequest]] = None,
)#

Update request pool status

This method will full up the active request pool, if it has less than max batch size elements from the waiting request pool. If provided with a request dict, it will put the completed requests into the completed request pool and add waiting request into active pool.

Parameters:

result (OrderedDict[int, InferenceRequest], optional) – The result returned by the engine. A dictionary with keys as the request ids, and values as the requests. Defaults to None.

abort_request(
request_id: int,
*,
exception: Optional[Union[BaseException, Type[BaseException]]] = None,
)#

Cancels the given request