`core.inference.inference_request`#

Module Contents#

Classes#

`Status`	Enum for status
`InferenceRequest`	Class for one inference request
`DynamicInferenceEventType`	Dynamic inference event type.
`DynamicInferenceEvent`	A lifecycle event for a dynamic inference requests.
`DynamicInferenceRequest`	Class for one inference request
`DynamicInferenceRequestRecord`	History of DynamicInferenceRequest objects over multiple request checkpoints.
`VLMInferenceRequest`	Class for a VLM inference request

Functions#

`serialize_tensor`	Serialize tensor to bytes.
`deserialize_tensor`	Deserialize tensor from bytes.
`serialize_ndarray`	Serialize numpy array to a JSON-compatible dict.
`deserialize_ndarray`	Deserialize numpy array from dict.
`unwrap_serialized_tensors`	Unwrap (“tensor”, […]) tuples produced by serialize() into plain lists.
`compute_block_hashes_batched`	Compute SHA-256 based hashes for all complete blocks in a prompt.

API#

core.inference.inference_request.serialize_tensor(tensor: torch.Tensor) → List#

Serialize tensor to bytes.

Parameters:: tensor (Tensor) – Tensor.
Returns:: (List) Tensor as a list

core.inference.inference_request.deserialize_tensor(tensor_as_list: List) → torch.Tensor#

Deserialize tensor from bytes.

Parameters:: tensor_as_list (List) – List representation of tensor.
Returns:: (Tensor) Tensor.

core.inference.inference_request.serialize_ndarray(arr: numpy.ndarray) → dict#: Serialize numpy array to a JSON-compatible dict.

core.inference.inference_request.deserialize_ndarray(obj: dict) → numpy.ndarray#: Deserialize numpy array from dict.

core.inference.inference_request.unwrap_serialized_tensors(serialized_request: dict) → dict#

Unwrap (“tensor”, […]) tuples produced by serialize() into plain lists.

Parameters:: serialized_request (dict) – A dict produced by serialize().
Returns:: A shallow copy with tensor wrapper tuples replaced by their inner lists.
Return type:: dict

class core.inference.inference_request.Status(*args, **kwds)#

Bases: enum.Enum

Enum for status

Initialization

WAITING_IN_QUEUE#: 1

ACTIVE_AND_GENERATING_TOKENS#: 2

ACTIVE_BUT_NOT_GENERATING_TOKENS#: 3

COMPLETED#: 4

FAILED#: 5

core.inference.inference_request.compute_block_hashes_batched( prompt_tokens: torch.Tensor, block_size: int, ) → List[int]#

Compute SHA-256 based hashes for all complete blocks in a prompt.

Each block hash is computed as SHA-256(parent_digest || block_bytes), where parent_digest chains from the previous block (starting from a zero digest). This provides cryptographic collision resistance with no exploitable algebraic structure.

Parameters:

prompt_tokens – All prompt token IDs, shape [seq_len].
block_size – Number of tokens per block.

Returns:

List of positive integer hash values in [1, 2^63-1], one per complete block.

class core.inference.inference_request.InferenceRequest#

Class for one inference request

Containing relevant data for an inference request

request_id: int#: None

prompt: str#: None

sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams]#: None

inference_parameters: Optional[megatron.core.inference.sampling_params.SamplingParams]#: None

prompt_tokens: Optional[List[int]]#: None

arrival_time: Optional[float]#: None

status: Optional[core.inference.inference_request.Status]#: None

encoder_prompt: Optional[str]#: None

generated_text: Optional[str]#: None

segments: Optional[List[str]]#: None

generated_segments: Optional[List[str]]#: None

generated_sequence_lengths: Optional[List[int]]#: None

generated_tokens: Optional[torch.Tensor]#: None

prompt_log_probs: Optional[torch.Tensor]#: None

generated_log_probs: Optional[torch.Tensor]#: None

prompt_top_n_logprobs: Optional[List[Dict[str, float]]]#: None

generated_top_n_logprobs: Optional[List[Dict[str, float]]]#: None

generated_length: Optional[int]#: None

tpot: Optional[List[int]]#: None

__post_init__()#

serialize() → dict#

Converts the instance into a serializable dictionary.

Returns:: (dict) A dictionary representation of the instance suitable for serialization.

classmethod deserialize( obj: dict, ) → core.inference.inference_request.InferenceRequest#

Deserialize request.

Parameters:: obj (dict) – Serialized request data.
Returns:: (InferenceRequest) Deserialized request.

_post_deserialize(obj: dict)#: This is called after the dataclass is initialized to handle any special deserialization logic.

class core.inference.inference_request.DynamicInferenceEventType(*args, **kwds)#

Bases: enum.Enum

Dynamic inference event type.

Initialization

ADD_ENGINE#: ‘auto(…)’

ADD_CONTEXT#: ‘auto(…)’

GENERATED_TOKEN#: ‘auto(…)’

PAUSE#: ‘auto(…)’

EVICT#: ‘auto(…)’

FINISH#: ‘auto(…)’

FAIL#: ‘auto(…)’

ERROR_TRANSIENT#: ‘auto(…)’

ERROR_NONTRANSIENT#: ‘auto(…)’

class core.inference.inference_request.DynamicInferenceEvent#

A lifecycle event for a dynamic inference requests.

An event is currently one of the following:

request added
request paused
request evicted
request finished
request failed
request error (transient)
request error (non-transient, i.e. fatal)

timestamp: Optional[float]#: None

type: core.inference.inference_request.DynamicInferenceEventType#: None

payload: Optional[Any]#: None

__post_init__()#

__str__()#

serialize() → dict#

Converts the instance into a serializable dictionary.

Returns:: Full event dict.
Return type:: dict

classmethod deserialize( obj: dict, ) → core.inference.inference_request.DynamicInferenceEvent#

Deserialize event.

Parameters:: obj – Serialized event data dict.
Returns:: (DynamicInferenceEvent) Deserialized event.

class core.inference.inference_request.DynamicInferenceRequest#

Bases: core.inference.inference_request.InferenceRequest

Class for one inference request

Containing relevant data for an dynamic inference request

request_id: int#: None

prompt: Optional[str]#: None

prompt_tokens: Optional[torch.Tensor]#: None

remaining_prompt_tokens: Optional[torch.Tensor]#: None

policy_epoch: Optional[list[tuple[int, int]]]#: None

kv_cache_epoch: Optional[list[tuple[int, int]]]#: None

latency: Optional[float]#: None

routing_indices: Optional[numpy.ndarray]#: None

finished_chunk_token_count: int#: 0

stop_word_ids: Optional[List[List[int]]]#: None

block_size_tokens: Optional[int]#: None

enable_prefix_caching: bool#: False

precomputed_block_hashes: List[int]#: ‘field(…)’

__post_init__()#

_compute_block_hashes() → None#

Compute hashes for all complete blocks in the prompt.

After this call:

precomputed_block_hashes is [] if prompt < block_size (no complete blocks)
precomputed_block_hashes is [hash1, …] for N complete blocks

property remaining_prompt_length#: Get the length of the remaining prompt tokens.

ttft: Optional[float]#: None

events: List[core.inference.inference_request.DynamicInferenceEvent]#: ‘field(…)’

event_add_engine: Optional[core.inference.inference_request.DynamicInferenceEvent]#: ‘field(…)’

generated_tokens: List[int]#: ‘field(…)’

__str__()#

serialize()#

Converts the instance into a serializable dictionary.

Returns:: (dict) A dictionary representation of the instance suitable for serialization.

_post_deserialize(obj)#

property tracked_metadata: List[Any]#

Obtain an ordered list of all request metadata to be tracked by the context.

This consists of metadata that is used to inform text generation. The values of such fields are tensorized and kept aligned with the current active batch.

Note that while the general request object is mutable, this metadata is inherently assumed to remain immutable once the request becomes active.

static get_metadata_types() → List[Tuple[str, torch.dtype, bool]]#

Keeps track of all request metadata names, dtypes, and target device.

Returns:: Mapping from metadata name to: name (str) - The name of the metadata field. dtype (torch.dtype) - The datatype of the metadata. on_device (bool) - Whether the metadata lives on GPU (True) or CPU (False).
Return type:: List[Tuple[str, torch.dtype, bool]]

add_event( type: core.inference.inference_request.DynamicInferenceEventType, payload: Optional[Any] = None, ) → core.inference.inference_request.DynamicInferenceEvent#: Add event.

add_event_add_engine()#: Add ‘add_engine’ event - called when request enters the engine queue.

add_event_add_context()#: Add ‘add_context’ event - called when request is added to context for prefill.

add_event_generated_token( token: int, blocks_total: Optional[int] = None, blocks_hashed_total: Optional[int] = None, blocks_hashed_active: Optional[int] = None, blocks_ref_count: Optional[int] = None, pre_fwd_active_token_count: Optional[int] = None, pre_fwd_step_count: Optional[int] = None, )#

Add ‘generated_token’ event - records each generated token.

Parameters:

token (int) – The token ID that was generated.
blocks_total (int) – Total block capacity from allocator.
blocks_hashed_total (int) – All allocated (hashed) blocks.
blocks_hashed_active (int) – Blocks with ref_count > 0.
blocks_ref_count (int) – Sum of block ref counts from allocator.
pre_fwd_active_token_count (int) – Active token count before forward pass.
pre_fwd_step_count (int) – Step count before forward pass.

add_event_pause()#: Add ‘pause’ event.

add_event_evict()#: Add ‘evict’ event.

add_event_finish()#: Add ‘finish’ event.

add_event_fail()#: Add ‘fail’ event.

add_event_error_transient(error: Exception)#: Add transient error event.

add_event_error_nontransient(error: Exception)#: Add non-transient error event.

succeeded() → bool#: Request experienced no non-transient errors.

failed() → bool#: Request experienced non-transient error.

class core.inference.inference_request.DynamicInferenceRequestRecord#

History of DynamicInferenceRequest objects over multiple request checkpoints.

requests: list[core.inference.inference_request.DynamicInferenceRequest]#: ‘field(…)’

latency: Optional[float]#: None

classmethod from_request( request: core.inference.inference_request.DynamicInferenceRequest, ) → core.inference.inference_request.DynamicInferenceRequestRecord#

Initialize record from a single request.

Parameters:: request (DynamicInferenceRequest) – Initial request.
Returns:: (DynamicInferenceRequestRecord) A record.

__getitem__( idx: int, ) → core.inference.inference_request.DynamicInferenceRequest#

Get request by index.

Parameters:: idx (int) – Request index.
Returns:: (DynamicInferenceRequest) Request object.

property request_id: int#

Get request id.

Returns:: (int) Request id.

checkpoint( tokenizer: megatron.core.tokenizers.MegatronTokenizer | None = None, )#

Maintain reference to previous request, and then append a new request that concatenates the previous prompt and generations.

Parameters:: tokenizer (MegatronTokenizer | None) – (Deprecated) Tokenizer.

merge( tokenizer: megatron.core.tokenizers.MegatronTokenizer | None = None, ) → core.inference.inference_request.DynamicInferenceRequest#

Merge requests into a single checkpoint-agnostic request object.

Parameters:: tokenizer (MegatronTokenizer | None) – (Deprecated) Tokenizer.
Returns:: (DynamicInferenceRequest) Merged request.

serialize() → dict#

Converts the instance into a serializable dictionary.

Returns:: (dict) A dictionary representation of the instance suitable for serialization.

classmethod deserialize( obj: dict, ) → core.inference.inference_request.DynamicInferenceRequestRecord#

Deserialize record.

Parameters:: obj (dict) – Serialized record data.
Returns:: (DynamicInferenceRequestRecord) Deserialized record.

class core.inference.inference_request.VLMInferenceRequest#

Bases: core.inference.inference_request.InferenceRequest

Class for a VLM inference request

num_img_embeddings_per_tile: int#: None

imgs: torch.Tensor#: None

num_tiles: torch.Tensor#: None

decoder_seq_length: int#: None

core.inference.inference_request#

Module Contents#

Classes#

Functions#

API#

`core.inference.inference_request`#