core.inference.scheduler#
Module Contents#
Classes#
Scheduler for handling requests to inference engine |
API#
- class core.inference.scheduler.Scheduler(max_batch_size)#
Scheduler for handling requests to inference engine
This class is responsible for handing of all the incomign requests
- Parameters:
max_batch_size (int) – The max batch size that we can pass to the inference engine at a time.
request_type (InferenceRequest) – The class to use for instantiating new requests.
Initialization
- get_new_request_id() int#
Gets a new request id
- add_request(
- prompt: Optional[str] = None,
- prompt_tokens: Optional[torch.Tensor] = None,
- encoder_prompt: Optional[str] = None,
- sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
- arrival_time: Optional[float] = None,
- streaming: bool = False,
- inference_request: Optional[megatron.core.inference.inference_request.InferenceRequest] = None,
- *,
- inference_parameters: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
Add an incoming request
This method will add the request to either the active pool or the waiting pool depending on the batch size.
- Parameters:
prompt (str) – Input prompt string
prompt_tokens (torch.Tensor) – A torch tensor having the input prompts tokenized
encoder_prompt (str) – Encoder input string
sampling_params (SamplingParams) – The sampling parameters
arrival_time (float, optional) – The incoming request time. Defaults to None.
streaming (bool, optional) – Whether to asynchronously stream tokens for this request.
inference_request (InferenceRequest, optional) – A fully constructed request. Defaults to None.
- Returns:
The request_id for the new request.
- num_requests_pending() int#
Get the number of requests pending.
This method returns the number of active + waiting requests.
- have_requests_pending() bool#
Method to check if there are requests pending.
This method returns False only when there are no active requests or waiting requests.
- add_earliest_waiting_request_to_active_pool()#
Utility to add the waiting request to active pool
This method will add the earliest request (FIFO) that is in the waiting request pool to the active request pool.
- update_requests_pools(
- result_dict: Optional[OrderedDict[int, megatron.core.inference.inference_request.InferenceRequest]] = None,
Update request pool status
This method will full up the active request pool, if it has less than max batch size elements from the waiting request pool. If provided with a request dict, it will put the completed requests into the completed request pool and add waiting request into active pool.
- Parameters:
result (OrderedDict[int, InferenceRequest], optional) – The result returned by the engine. A dictionary with keys as the request ids, and values as the requests. Defaults to None.
- abort_request(
- request_id: int,
- *,
- exception: Optional[Union[BaseException, Type[BaseException]]] = None,
Cancels the given request