Experimental gRPC Python API¶
Client Core¶
This module contains most of the core functionality of the library including setting up a connection, sending requests and receiving response to/from an active Triton server.
-
class
tritongrpcclient.
InferInput
(name, shape, datatype)¶ An object of InferInput class is used to describe input tensor for an inference request.
- Parameters
name (str) – The name of input whose data will be described by this object
shape (list) – The shape of the associated input.
datatype (str) – The datatype of the associated input.
-
datatype
()¶ Get the datatype of input associated with this object.
- Returns
The datatype of input
- Return type
str
-
name
()¶ Get the name of input associated with this object.
- Returns
The name of input
- Return type
str
-
set_data_from_numpy
(input_tensor)¶ Set the tensor data from the specified numpy array for input associated with this object.
- Parameters
input_tensor (numpy array) – The tensor data in numpy array format
- Raises
InferenceServerException – If failed to set data for the tensor.
-
set_shape
(shape)¶ Set the shape of input.
- Parameters
shape (list) – The shape of the associated input.
Set the tensor data from the specified shared memory region.
- Parameters
region_name (str) – The name of the shared memory region holding tensor data.
byte_size (int) – The size of the shared memory region holding tensor data.
offset (int) – The offset, in bytes, into the region where the data for the tensor starts. The default value is 0.
-
shape
()¶ Get the shape of input associated with this object.
- Returns
The shape of input
- Return type
list
-
class
tritongrpcclient.
InferRequestedOutput
(name, class_count=0)¶ An object of InferRequestedOutput class is used to describe a requested output tensor for an inference request.
- Parameters
name (str) – The name of output tensor to associate with this object
class_count (int) – The number of classifications to be requested. The default value is 0 which means the classification results are not requested.
-
name
()¶ Get the name of output associated with this object.
- Returns
The name of output
- Return type
str
Marks the output to return the inference result in specified shared memory region.
- Parameters
region_name (str) – The name of the shared memory region to hold tensor data.
byte_size (int) – The size of the shared memory region to hold tensor data.
offset (int) – The offset, in bytes, into the region where the data for the tensor starts. The default value is 0.
- Raises
InferenceServerException – If failed to set shared memory for the tensor.
-
class
tritongrpcclient.
InferResult
(result)¶ An object of InferResult class holds the response of an inference request and provide methods to retrieve inference results.
- Parameters
result (protobuf message) – The ModelInferResponse returned by the server
-
as_numpy
(name)¶ Get the tensor data for output associated with this object in numpy format
- Parameters
name (str) – The name of the output tensor whose result is to be retrieved.
- Returns
The numpy array containing the response data for the tensor or None if the data for specified tensor name is not found.
- Return type
numpy array
-
get_output
(name, as_json=False)¶ Retrieves the InferOutputTensor corresponding to the named ouput.
- Parameters
name (str) – The name of the tensor for which Output is to be retrieved.
as_json (bool) – If True then returns response as a json dict, otherwise as a protobuf message. Default value is False.
- Returns
If a InferOutputTensor with specified name is present in ModelInferResponse then returns it as a protobuf messsage or dict, otherwise returns None.
- Return type
protobuf message or dict
-
get_response
(as_json=False)¶ Retrieves the complete ModelInferResponse as a json dict object or protobuf message
- Parameters
as_json (bool) – If True then returns response as a json dict, otherwise as a protobuf message. Default value is False.
- Returns
The underlying ModelInferResponse as a protobuf message or dict.
- Return type
protobuf message or dict
-
class
tritongrpcclient.
InferenceServerClient
(url, verbose=False)¶ An InferenceServerClient object is used to perform any kind of communication with the InferenceServer using gRPC protocol.
- Parameters
url (str) – The inference server URL, e.g. ‘localhost:8001’.
verbose (bool) – If True generate verbose output. Default value is False.
- Raises
Exception – If unable to create a client.
-
async_infer
(model_name, inputs, callback, model_version='', outputs=None, request_id='', sequence_id=0, sequence_start=False, sequence_end=False, priority=0, timeout=None, headers=None)¶ Run asynchronous inference using the supplied ‘inputs’ requesting the outputs specified by ‘outputs’.
- Parameters
model_name (str) – The name of the model to run inference.
inputs (list) – A list of InferInput objects, each describing data for a input tensor required by the model.
callback (function) – Python function that is invoked once the request is completed. The function must reserve the last two arguments (result, error) to hold InferResult and InferenceServerException objects respectively which will be provided to the function when executing the callback. The ownership of these objects will be given to the user. The ‘error’ would be None for a successful inference.
model_version (str) – The version of the model to run inference. The default value is an empty string which means then the server will choose a version based on the model and internal policy.
outputs (list) – A list of InferRequestedOutput objects, each describing how the output data must be returned. If not specified all outputs produced by the model will be returned using default settings.
request_id (str) – Optional identifier for the request. If specified will be returned in the response. Default value is an empty string which means no request_id will be used.
sequence_id (int) – The unique identifier for the sequence being represented by the object. Default value is 0 which means that the request does not belong to a sequence.
sequence_start (bool) – Indicates whether the request being added marks the start of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.
sequence_end (bool) – Indicates whether the request being added marks the end of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.
priority (int) – Indicates the priority of the request. Priority value zero indicates that the default priority level should be used (i.e. same behavior as not specifying the priority parameter). Lower value priorities indicate higher priority levels. Thus the highest priority level is indicated by setting the parameter to 1, the next highest is 2, etc. If not provided, the server will handle the request using default setting for the model.
timeout (int) – The timeout value for the request, in microseconds. If the request cannot be completed within the time the server can take a model-specific action such as terminating the request. If not provided, the server will handle the request using default setting for the model.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
- Raises
InferenceServerException – If server fails to issue inference.
-
async_stream_infer
(model_name, inputs, model_version='', outputs=None, request_id='', sequence_id=0, sequence_start=False, sequence_end=False, priority=0, timeout=None)¶ Runs an asynchronous inference over gRPC bi-directional streaming API. A stream must be established with a call to start_stream() before calling this function. All the results will be provided to the callback function associated with the stream.
- Parameters
model_name (str) – The name of the model to run inference.
inputs (list) – A list of InferInput objects, each describing data for a input tensor required by the model.
model_version (str) – The version of the model to run inference. The default value is an empty string which means then the server will choose a version based on the model and internal policy.
outputs (list) – A list of InferRequestedOutput objects, each describing how the output data must be returned. If not specified all outputs produced by the model will be returned using default settings.
request_id (str) – Optional identifier for the request. If specified will be returned in the response. Default value is an empty string which means no request_id will be used.
sequence_id (int) – The unique identifier for the sequence being represented by the object. Default value is 0 which means that the request does not belong to a sequence.
sequence_start (bool) – Indicates whether the request being added marks the start of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.
sequence_end (bool) – Indicates whether the request being added marks the end of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.
priority (int) – Indicates the priority of the request. Priority value zero indicates that the default priority level should be used (i.e. same behavior as not specifying the priority parameter). Lower value priorities indicate higher priority levels. Thus the highest priority level is indicated by setting the parameter to 1, the next highest is 2, etc. If not provided, the server will handle the request using default setting for the model.
timeout (int) – The timeout value for the request, in microseconds. If the request cannot be completed within the time the server can take a model-specific action such as terminating the request. If not provided, the server will handle the request using default setting for the model.
- Raises
InferenceServerException – If server fails to issue inference.
-
close
()¶ Close the client. Any future calls to server will result in an Error.
Request cuda shared memory status from the server.
- Parameters
region_name (str) – The name of the region to query status. The default value is an empty string, which means that the status of all active cuda shared memory will be returned.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
as_json (bool) – If True then returns cuda shared memory status as a json dict, otherwise as a protobuf message. Default value is False.
- Returns
The JSON dict or CudaSharedMemoryStatusResponse message holding the cuda shared memory status.
- Return type
dict or protobuf message
- Raises
InferenceServerException – If unable to get the status of specified shared memory.
-
get_inference_statistics
(model_name='', model_version='', headers=None, as_json=False)¶ Get the inference statistics for the specified model name and version.
- Parameters
model_name (str) – The name of the model to get statistics. The default value is an empty string, which means statistics of all models will be returned.
model_version (str) – The version of the model to get inference statistics. The default value is an empty string which means then the server will return the statistics of all available model versions.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
as_json (bool) – If True then returns inference statistics as a json dict, otherwise as a protobuf message. Default value is False.
- Raises
InferenceServerException – If unable to get the model inference statistics.
-
get_model_config
(model_name, model_version='', headers=None, as_json=False)¶ Contact the inference server and get the configuration for specified model.
- Parameters
model_name (str) – The name of the model
model_version (str) – The version of the model to get configuration. The default value is an empty string which means then the server will choose a version based on the model and internal policy.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
as_json (bool) – If True then returns configuration as a json dict, otherwise as a protobuf message. Default value is False.
- Returns
The JSON dict or ModelConfigResponse message holding the metadata.
- Return type
dict or protobuf message
- Raises
InferenceServerException – If unable to get model configuration.
-
get_model_metadata
(model_name, model_version='', headers=None, as_json=False)¶ Contact the inference server and get the metadata for specified model.
- Parameters
model_name (str) – The name of the model
model_version (str) – The version of the model to get metadata. The default value is an empty string which means then the server will choose a version based on the model and internal policy.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
as_json (bool) – If True then returns model metadata as a json dict, otherwise as a protobuf message. Default value is False.
- Returns
The JSON dict or ModelMetadataResponse message holding the metadata.
- Return type
dict or protobuf message
- Raises
InferenceServerException – If unable to get model metadata.
-
get_model_repository_index
(headers=None, as_json=False)¶ Get the index of model repository contents
- Parameters
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
as_json (bool) – If True then returns model repository index as a json dict, otherwise as a protobuf message. Default value is False.
- Returns
The JSON dict or RepositoryIndexResponse message holding the model repository index.
- Return type
dict or protobuf message
-
get_server_metadata
(headers=None, as_json=False)¶ Contact the inference server and get its metadata.
- Parameters
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
as_json (bool) – If True then returns server metadata as a json dict, otherwise as a protobuf message. Default value is False.
- Returns
The JSON dict or ServerMetadataResponse message holding the metadata.
- Return type
dict or protobuf message
- Raises
InferenceServerException – If unable to get server metadata.
Request system shared memory status from the server.
- Parameters
region_name (str) – The name of the region to query status. The default value is an empty string, which means that the status of all active system shared memory will be returned.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
as_json (bool) – If True then returns system shared memory status as a json dict, otherwise as a protobuf message. Default value is False.
- Returns
The JSON dict or SystemSharedMemoryStatusResponse message holding the system shared memory status.
- Return type
dict or protobuf message
- Raises
InferenceServerException – If unable to get the status of specified shared memory.
-
infer
(model_name, inputs, model_version='', outputs=None, request_id='', sequence_id=0, sequence_start=False, sequence_end=False, priority=0, timeout=None, headers=None)¶ Run synchronous inference using the supplied ‘inputs’ requesting the outputs specified by ‘outputs’.
- Parameters
model_name (str) – The name of the model to run inference.
inputs (list) – A list of InferInput objects, each describing data for a input tensor required by the model.
model_version (str) – The version of the model to run inference. The default value is an empty string which means then the server will choose a version based on the model and internal policy.
outputs (list) – A list of InferRequestedOutput objects, each describing how the output data must be returned. If not specified all outputs produced by the model will be returned using default settings.
request_id (str) – Optional identifier for the request. If specified will be returned in the response. Default value is an empty string which means no request_id will be used.
sequence_id (int) – The unique identifier for the sequence being represented by the object. Default value is 0 which means that the request does not belong to a sequence.
sequence_start (bool) – Indicates whether the request being added marks the start of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.
sequence_end (bool) – Indicates whether the request being added marks the end of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.
priority (int) – Indicates the priority of the request. Priority value zero indicates that the default priority level should be used (i.e. same behavior as not specifying the priority parameter). Lower value priorities indicate higher priority levels. Thus the highest priority level is indicated by setting the parameter to 1, the next highest is 2, etc. If not provided, the server will handle the request using default setting for the model.
timeout (int) – The timeout value for the request, in microseconds. If the request cannot be completed within the time the server can take a model-specific action such as terminating the request. If not provided, the server will handle the request using default setting for the model.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
- Returns
The object holding the result of the inference.
- Return type
- Raises
InferenceServerException – If server fails to perform inference.
-
is_model_ready
(model_name, model_version='', headers=None)¶ Contact the inference server and get the readiness of specified model.
- Parameters
model_name (str) – The name of the model to check for readiness.
model_version (str) – The version of the model to check for readiness. The default value is an empty string which means then the server will choose a version based on the model and internal policy.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
- Returns
True if the model is ready, False if not ready.
- Return type
bool
- Raises
InferenceServerException – If unable to get model readiness.
-
is_server_live
(headers=None)¶ Contact the inference server and get liveness.
- Parameters
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
- Returns
True if server is live, False if server is not live.
- Return type
bool
- Raises
InferenceServerException – If unable to get liveness.
-
is_server_ready
(headers=None)¶ Contact the inference server and get readiness.
- Parameters
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
- Returns
True if server is ready, False if server is not ready.
- Return type
bool
- Raises
InferenceServerException – If unable to get readiness.
-
load_model
(model_name, headers=None)¶ Request the inference server to load or reload specified model.
- Parameters
model_name (str) – The name of the model to be loaded.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
- Raises
InferenceServerException – If unable to load the model.
Request the server to register a system shared memory with the following specification.
- Parameters
name (str) – The name of the region to register.
raw_handle (bytes) – The raw serialized cudaIPC handle in base64 encoding.
device_id (int) – The GPU device ID on which the cudaIPC handle was created.
byte_size (int) – The size of the cuda shared memory region, in bytes.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
- Raises
InferenceServerException – If unable to register the specified cuda shared memory.
Request the server to register a system shared memory with the following specification.
- Parameters
name (str) – The name of the region to register.
key (str) – The key of the underlying memory object that contains the system shared memory region.
byte_size (int) – The size of the system shared memory region, in bytes.
offset (int) – Offset, in bytes, within the underlying memory object to the start of the system shared memory region. The default value is zero.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
- Raises
InferenceServerException – If unable to register the specified system shared memory.
-
start_stream
(callback, headers=None)¶ Starts a grpc bi-directional stream to send streaming inferences. Note: When using stream, user must ensure the InferenceServerClient.close() gets called at exit.
- Parameters
callback (function) – Python function that is invoked upon receiving response from the underlying stream. The function must reserve the last two arguments (result, error) to hold InferResult and InferenceServerException objects respectively which will be provided to the function when executing the callback. The ownership of these objects will be given to the user. The ‘error’ would be None for a successful inference.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
- Raises
InferenceServerException – If unable to start a stream or a stream was already running for this client.
-
stop_stream
()¶ Stops a stream if one available.
-
unload_model
(model_name, headers=None)¶ Request the inference server to unload specified model.
- Parameters
model_name (str) – The name of the model to be unloaded.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
- Raises
InferenceServerException – If unable to unload the model.
Request the server to unregister a cuda shared memory with the specified name.
- Parameters
name (str) – The name of the region to unregister. The default value is empty string which means all the cuda shared memory regions will be unregistered.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
- Raises
InferenceServerException – If unable to unregister the specified cuda shared memory region.
Request the server to unregister a system shared memory with the specified name.
- Parameters
name (str) – The name of the region to unregister. The default value is empty string which means all the system shared memory regions will be unregistered.
headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.
- Raises
InferenceServerException – If unable to unregister the specified system shared memory region.
Client Utils¶
This module exposes additional supporting utilities.
-
tritonclientutils.utils.
raise_error
(msg)¶ Raise error with the provided message
-
exception
tritonclientutils.utils.
InferenceServerException
(msg, status=None, debug_details=None)¶ Exception indicating non-Success status.
- Parameters
msg (str) – A brief description of error
status (str) – The error code
debug_details (str) – The additional details on the error
-
debug_details
()¶ Get the detailed information about the exception for debugging purposes
- Returns
Returns the exception details
- Return type
str
-
message
()¶ Get the exception message.
- Returns
The message associated with this exception, or None if no message.
- Return type
str
-
status
()¶ Get the status of the exception.
- Returns
Returns the status of the exception
- Return type
str
-
tritonclientutils.utils.
serialize_byte_tensor
(input_tensor)¶ Serializes a bytes tensor into a flat numpy array of length prepend bytes. Can pass bytes tensor as numpy array of bytes with dtype of np.bytes_, numpy strings with dtype of np.str_ or python strings with dtype of np.object.
- Parameters
input_tensor (np.array) – The bytes tensor to serialize.
- Returns
serialized_bytes_tensor – The 1-D numpy array of type uint8 containing the serialized bytes in ‘C’ order.
- Return type
np.array
- Raises
InferenceServerException – If unable to serialize the given tensor.
-
tritonclientutils.utils.
deserialize_bytes_tensor
(encoded_tensor)¶ Deserializes an encoded bytes tensor into an numpy array of dtype of python objects
- Parameters
encoded_tensor (bytes) – The encoded bytes tensor where each element has its length in first 4 bytes followed by the content
- Returns
string_tensor – The 1-D numpy array of type object containing the deserialized bytes in ‘C’ order.
- Return type
np.array