core.inference.inference_request#
Module Contents#
Classes#
Enum for status |
|
Class for one inference request |
|
Dynamic inference event type. |
|
A lifecycle event for a dynamic inference requests. |
|
Class for one inference request |
|
History of DynamicInferenceRequest objects over multiple request checkpoints. |
|
Class for a VLM inference request |
Functions#
Serialize tensor to bytes. |
|
Deserialize tensor from bytes. |
API#
- core.inference.inference_request.serialize_tensor(tensor: torch.Tensor) List#
Serialize tensor to bytes.
- Parameters:
tensor (Tensor) – Tensor.
- Returns:
(List) Tensor as a list
- core.inference.inference_request.deserialize_tensor(tensor_as_list: List) torch.Tensor#
Deserialize tensor from bytes.
- Parameters:
tensor_as_list (List) – List representation of tensor.
- Returns:
(Tensor) Tensor.
- class core.inference.inference_request.Status(*args, **kwds)#
Bases:
enum.EnumEnum for status
Initialization
- WAITING_IN_QUEUE#
1
- ACTIVE_AND_GENERATING_TOKENS#
2
- ACTIVE_BUT_NOT_GENERATING_TOKENS#
3
- COMPLETED#
4
- FAILED#
5
- class core.inference.inference_request.InferenceRequest#
Class for one inference request
Containing relevant data for an inference request
- request_id: int#
None
- prompt: str#
None
- sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams]#
None
- inference_parameters: Optional[megatron.core.inference.sampling_params.SamplingParams]#
None
- prompt_tokens: Optional[List[int]]#
None
- arrival_time: Optional[float]#
None
- status: Optional[core.inference.inference_request.Status]#
None
- encoder_prompt: Optional[str]#
None
- generated_text: Optional[str]#
None
- segments: Optional[List[str]]#
None
- generated_segments: Optional[List[str]]#
None
- generated_sequence_lengths: Optional[List[int]]#
None
- generated_tokens: Optional[torch.Tensor]#
None
- prompt_log_probs: Optional[torch.Tensor]#
None
- generated_log_probs: Optional[torch.Tensor]#
None
- prompt_top_n_logprobs: Optional[List[Dict[str, float]]]#
None
- generated_top_n_logprobs: Optional[List[Dict[str, float]]]#
None
- generated_length: Optional[int]#
None
- tpot: Optional[List[int]]#
None
- __post_init__()#
- serialize() dict#
Converts the instance into a serializable dictionary.
- Returns:
(dict) A dictionary representation of the instance suitable for serialization.
- classmethod deserialize(
- obj: dict,
Deserialize request.
- Parameters:
obj (dict) – Serialized request data.
- Returns:
(InferenceRequest) Deserialized request.
- _post_deserialize(obj: dict)#
This is called after the dataclass is initialized to handle any special deserialization logic.
- class core.inference.inference_request.DynamicInferenceEventType(*args, **kwds)#
Bases:
enum.EnumDynamic inference event type.
Initialization
- ADD#
‘auto(…)’
- PAUSE#
‘auto(…)’
- EVICT#
‘auto(…)’
- FINISH#
‘auto(…)’
- FAIL#
‘auto(…)’
- ERROR_TRANSIENT#
‘auto(…)’
- ERROR_NONTRANSIENT#
‘auto(…)’
- class core.inference.inference_request.DynamicInferenceEvent#
A lifecycle event for a dynamic inference requests.
An event is currently one of the following:
request added
request paused
request evicted
request finished
request failed
request error (transient)
request error (non-transient, i.e. fatal)
- timestamp: Optional[float]#
None
- payload: Optional[Any]#
None
- __post_init__()#
- __str__()#
- serialize() dict#
Converts the instance into a serializable dictionary.
- Returns:
(dict) A dictionary representation of the instance suitable for serialization.
- classmethod deserialize(
- obj: dict,
Deserialize event.
- Parameters:
obj (dict) – Serialized event data.
- Returns:
(DynamicInferenceEvent) Deserialized event.
- class core.inference.inference_request.DynamicInferenceRequest#
Bases:
core.inference.inference_request.InferenceRequestClass for one inference request
Containing relevant data for an dynamic inference request
- request_id: int#
None
- generated_tokens: List[int]#
‘field(…)’
- prompt: Optional[str]#
None
- prompt_tokens: Optional[torch.Tensor]#
None
- remaining_prompt_tokens: Optional[torch.Tensor]#
None
- latency: Optional[float]#
None
- finished_chunk_token_count: int#
0
- stop_word_ids: Optional[List[List[int]]]#
None
- __post_init__()#
- property remaining_prompt_length#
Get the length of the remaining prompt tokens.
- events: List[core.inference.inference_request.DynamicInferenceEvent]#
‘field(…)’
- __str__()#
- serialize() dict#
Converts the instance into a serializable dictionary.
- Returns:
(dict) A dictionary representation of the instance suitable for serialization.
- _post_deserialize(obj)#
- property tracked_metadata: List[Any]#
Obtain an ordered list of all request metadata to be tracked by the context.
This consists of metadata that is used to inform text generation. The values of such fields are tensorized and kept aligned with the current active batch.
Note that while the general request object is mutable, this metadata is inherently assumed to remain immutable once the request becomes active.
- static get_metadata_types() List[Tuple[str, torch.dtype, bool]]#
Keeps track of all request metadata names, dtypes, and target device.
- Returns:
Mapping from metadata name to: name (str) - The name of the metadata field. dtype (torch.dtype) - The datatype of the metadata. on_device (bool) - Whether the metadata lives on GPU (True) or CPU (False).
- Return type:
List[Tuple[str, torch.dtype, bool]]
- add_event(
- type: core.inference.inference_request.DynamicInferenceEventType,
- payload: Optional[Any] = None,
Add event.
- add_event_add()#
Add ‘add’ event.
- add_event_pause()#
Add ‘pause’ event.
- add_event_evict()#
Add ‘evict’ event.
- add_event_finish()#
Add ‘finish’ event.
- add_event_fail()#
Add ‘fail’ event.
- add_event_error_transient(error: Exception)#
Add transient error event.
- add_event_error_nontransient(error: Exception)#
Add non-transient error event.
- succeeded() bool#
Request experienced no non-transient errors.
- failed() bool#
Request experienced non-transient error.
- class core.inference.inference_request.DynamicInferenceRequestRecord#
History of DynamicInferenceRequest objects over multiple request checkpoints.
- requests: list[core.inference.inference_request.DynamicInferenceRequest]#
‘field(…)’
- latency: Optional[float]#
None
- classmethod from_request( ) core.inference.inference_request.DynamicInferenceRequestRecord#
Initialize record from a single request.
- Parameters:
request (DynamicInferenceRequest) – Initial request.
- Returns:
(DynamicInferenceRequestRecord) A record.
- __getitem__(
- idx: int,
Get request by index.
- Parameters:
idx (int) – Request index.
- Returns:
(DynamicInferenceRequest) Request object.
- property request_id: int#
Get request id.
- Returns:
(int) Request id.
- checkpoint(
- tokenizer: megatron.core.tokenizers.MegatronTokenizer | None = None,
Maintain reference to previous request, and then append a new request that concatenates the previous prompt and generations.
- Parameters:
tokenizer (MegatronTokenizer | None) – (Deprecated) Tokenizer.
- merge(
- tokenizer: megatron.core.tokenizers.MegatronTokenizer | None = None,
Merge requests into a single checkpoint-agnostic request object.
- Parameters:
tokenizer (MegatronTokenizer | None) – (Deprecated) Tokenizer.
- Returns:
(DynamicInferenceRequest) Merged request.
- serialize() dict#
Converts the instance into a serializable dictionary.
- Returns:
(dict) A dictionary representation of the instance suitable for serialization.
- classmethod deserialize(
- obj: dict,
Deserialize record.
- Parameters:
obj (dict) – Serialized record data.
- Returns:
(DynamicInferenceRequestRecord) Deserialized record.