tritonclient.grpc

tritonclient.grpc#

class tritonclient.grpc.InferInput(name, shape, datatype)#

An object of InferInput class is used to describe input tensor for an inference request.

Parameters:
  • name (str) – The name of input whose data will be described by this object

  • shape (list) – The shape of the associated input.

  • datatype (str) – The datatype of the associated input.

_get_content()#

Retrieve the contents for this tensor in raw bytes. :returns: The associated contents for this tensor in raw bytes. :rtype: bytes

_get_tensor()#

Retrieve the underlying InferInputTensor message. :returns: The underlying InferInputTensor protobuf message. :rtype: protobuf message

datatype()#

Get the datatype of input associated with this object.

Returns:

The datatype of input

Return type:

str

name()#

Get the name of input associated with this object.

Returns:

The name of input

Return type:

str

set_data_from_numpy(input_tensor)#

Set the tensor data from the specified numpy array for input associated with this object.

Parameters:

input_tensor (numpy array) – The tensor data in numpy array format

Returns:

The updated input

Return type:

InferInput

Raises:

InferenceServerException – If failed to set data for the tensor.

set_shape(shape)#

Set the shape of input.

Parameters:

shape (list) – The shape of the associated input.

Returns:

The updated input

Return type:

InferInput

set_shared_memory(region_name, byte_size, offset=0)#

Set the tensor data from the specified shared memory region.

Parameters:
  • region_name (str) – The name of the shared memory region holding tensor data.

  • byte_size (int) – The size of the shared memory region holding tensor data.

  • offset (int) – The offset, in bytes, into the region where the data for the tensor starts. The default value is 0.

Returns:

The updated input

Return type:

InferInput

shape()#

Get the shape of input associated with this object.

Returns:

The shape of input

Return type:

list

class tritonclient.grpc.InferRequestedOutput(name, class_count=0)#

An object of InferRequestedOutput class is used to describe a requested output tensor for an inference request.

Parameters:
  • name (str) – The name of output tensor to associate with this object

  • class_count (int) – The number of classifications to be requested. The default value is 0 which means the classification results are not requested.

_get_tensor()#

Retrieve the underlying InferRequestedOutputTensor message. :returns: The underlying InferRequestedOutputTensor protobuf message. :rtype: protobuf message

name()#

Get the name of output associated with this object.

Returns:

The name of output

Return type:

str

set_shared_memory(region_name, byte_size, offset=0)#

Marks the output to return the inference result in specified shared memory region.

Parameters:
  • region_name (str) – The name of the shared memory region to hold tensor data.

  • byte_size (int) – The size of the shared memory region to hold tensor data.

  • offset (int) – The offset, in bytes, into the region where the data for the tensor starts. The default value is 0.

Raises:

InferenceServerException – If failed to set shared memory for the tensor.

unset_shared_memory()#

Clears the shared memory option set by the last call to InferRequestedOutput.set_shared_memory(). After call to this function requested output will no longer be returned in a shared memory region.

class tritonclient.grpc.InferResult(result)#

An object of InferResult class holds the response of an inference request and provide methods to retrieve inference results.

Parameters:

result (protobuf message) – The ModelInferResponse returned by the server

as_numpy(name)#

Get the tensor data for output associated with this object in numpy format

Parameters:

name (str) – The name of the output tensor whose result is to be retrieved.

Returns:

The numpy array containing the response data for the tensor or None if the data for specified tensor name is not found.

Return type:

numpy array

get_output(name, as_json=False)#

Retrieves the InferOutputTensor corresponding to the named output.

Parameters:
  • name (str) – The name of the tensor for which Output is to be retrieved.

  • as_json (bool) – If True then returns response as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

Returns:

If a InferOutputTensor with specified name is present in ModelInferResponse then returns it as a protobuf message or dict, otherwise returns None.

Return type:

protobuf message or dict

get_response(as_json=False)#

Retrieves the complete ModelInferResponse as a json dict object or protobuf message

Parameters:

as_json (bool) – If True then returns response as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

Returns:

The underlying ModelInferResponse as a protobuf message or dict.

Return type:

protobuf message or dict

class tritonclient.grpc.InferenceServerClient(url, verbose=False, ssl=False, root_certificates=None, private_key=None, certificate_chain=None, creds=None, keepalive_options=None, channel_args=None)#

An InferenceServerClient object is used to perform any kind of communication with the InferenceServer using gRPC protocol. Most of the methods are thread-safe except start_stream, stop_stream and async_stream_infer. Accessing a client stream with different threads will cause undefined behavior.

Parameters:
  • url (str) – The inference server URL, e.g. ‘localhost:8001’.

  • verbose (bool) – If True generate verbose output. Default value is False.

  • ssl (bool) – If True use SSL encrypted secure channel. Default is False.

  • root_certificates (str) – File holding the PEM-encoded root certificates as a byte string, or None to retrieve them from a default location chosen by gRPC runtime. The option is ignored if ssl is False. Default is None.

  • private_key (str) – File holding the PEM-encoded private key as a byte string, or None if no private key should be used. The option is ignored if ssl is False. Default is None.

  • certificate_chain (str) – File holding PEM-encoded certificate chain as a byte string to use or None if no certificate chain should be used. The option is ignored if ssl is False. Default is None.

  • creds (grpc.ChannelCredentials) – A grpc.ChannelCredentials object to use for the connection. The ssl, root_certificates, private_key and certificate_chain options will be ignored when using this option. Default is None.

  • keepalive_options (KeepAliveOptions) – Object encapsulating various GRPC KeepAlive options. See the class definition for more information. Default is None.

  • channel_args (List[Tuple]) – List of Tuple pairs (“key”, value) to be passed directly to the GRPC channel as the channel_arguments. If this argument is provided, it is expected the channel arguments are correct and complete, and the keepalive_options parameter will be ignored since the corresponding keepalive channel arguments can be set directly in this parameter. See https://grpc.github.io/grpc/python/glossary.html#term-channel_arguments for more details. Default is None.

Raises:

Exception – If unable to create a client.

_get_metadata(headers)#
async_infer(model_name, inputs, callback, model_version='', outputs=None, request_id='', sequence_id=0, sequence_start=False, sequence_end=False, priority=0, timeout=None, client_timeout=None, headers=None, compression_algorithm=None, parameters=None)#

Run asynchronous inference using the supplied ‘inputs’ requesting the outputs specified by ‘outputs’.

Parameters:
  • model_name (str) – The name of the model to run inference.

  • inputs (list) – A list of InferInput objects, each describing data for a input tensor required by the model.

  • callback (function) – Python function that is invoked once the request is completed. The function must reserve the last two arguments (result, error) to hold InferResult and InferenceServerException objects respectively which will be provided to the function when executing the callback. The ownership of these objects will be given to the user. The ‘error’ would be None for a successful inference.

  • model_version (str) – The version of the model to run inference. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • outputs (list) – A list of InferRequestedOutput objects, each describing how the output data must be returned. If not specified all outputs produced by the model will be returned using default settings.

  • request_id (str) – Optional identifier for the request. If specified will be returned in the response. Default value is an empty string which means no request_id will be used.

  • sequence_id (int) – The unique identifier for the sequence being represented by the object. Default value is 0 which means that the request does not belong to a sequence.

  • sequence_start (bool) – Indicates whether the request being added marks the start of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • sequence_end (bool) – Indicates whether the request being added marks the end of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • priority (int) – Indicates the priority of the request. Priority value zero indicates that the default priority level should be used (i.e. same behavior as not specifying the priority parameter). Lower value priorities indicate higher priority levels. Thus the highest priority level is indicated by setting the parameter to 1, the next highest is 2, etc. If not provided, the server will handle the request using default setting for the model.

  • timeout (int) – The timeout value for the request, in microseconds. If the request cannot be completed within the time the server can take a model-specific action such as terminating the request. If not provided, the server will handle the request using default setting for the model. This option is only respected by the model that is configured with dynamic batching. See here for more details: triton-inference-server/server The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and provide error with message “Deadline Exceeded” in the callback when the specified time elapses. The default value is None which means client will wait for the response from the server.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • compression_algorithm (str) – Optional grpc compression algorithm to be used on client side. Currently supports “deflate”, “gzip” and None. By default, no compression is used.

  • parameters (dict) – Optional custom parameters to be included in the inference request.

Returns:

A representation of a computation in another control flow. Computations represented by a Future may be yet to be begun, ongoing, or have already completed.

Note

This object can be used to cancel the inference request like below:

>>> future = async_infer(...)
>>> ret = future.cancel()

Return type:

CallContext

Raises:

InferenceServerException – If server fails to issue inference.

async_stream_infer(model_name, inputs, model_version='', outputs=None, request_id='', sequence_id=0, sequence_start=False, sequence_end=False, enable_empty_final_response=False, priority=0, timeout=None, parameters=None)#

Runs an asynchronous inference over gRPC bi-directional streaming API. A stream must be established with a call to start_stream() before calling this function. All the results will be provided to the callback function associated with the stream.

Parameters:
  • model_name (str) – The name of the model to run inference.

  • inputs (list) – A list of InferInput objects, each describing data for a input tensor required by the model.

  • model_version (str) – The version of the model to run inference. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • outputs (list) – A list of InferRequestedOutput objects, each describing how the output data must be returned. If not specified all outputs produced by the model will be returned using default settings.

  • request_id (str) – Optional identifier for the request. If specified will be returned in the response. Default value is an empty string which means no request_id will be used.

  • sequence_id (int or str) – The unique identifier for the sequence being represented by the object. A value of 0 or “” means that the request does not belong to a sequence. Default is 0.

  • sequence_start (bool) – Indicates whether the request being added marks the start of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0 or “”.

  • sequence_end (bool) – Indicates whether the request being added marks the end of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0 or “”.

  • enable_empty_final_response (bool) – Indicates whether “empty” responses should be generated and sent back to the client from the server during streaming inference when they contain the TRITONSERVER_RESPONSE_COMPLETE_FINAL flag. This strictly relates to the case of models/backends that send flags-only responses (use TRITONBACKEND_ResponseFactorySendFlags(TRITONSERVER_RESPONSE_COMPLETE_FINAL) or InferenceResponseSender.send(flags=TRITONSERVER_RESPONSE_COMPLETE_FINAL)) Currently, this only occurs for decoupled models, and can be used to communicate to the client when a request has received its final response from the model. If the backend sends the final flag along with a non-empty response, this arg is not needed. Default value is False.

  • priority (int) – Indicates the priority of the request. Priority value zero indicates that the default priority level should be used (i.e. same behavior as not specifying the priority parameter). Lower value priorities indicate higher priority levels. Thus the highest priority level is indicated by setting the parameter to 1, the next highest is 2, etc. If not provided, the server will handle the request using default setting for the model.

  • timeout (int) – The timeout value for the request, in microseconds. If the request cannot be completed within the time the server can take a model-specific action such as terminating the request. If not provided, the server will handle the request using default setting for the model. This does not stop the grpc stream itself and is only respected by the model that is configured with dynamic batching. See here for more details: triton-inference-server/server

  • parameters (dict) – Optional custom parameters to be included in the inference request.

Raises:

InferenceServerException – If server fails to issue inference.

close()#

Close the client. Any future calls to server will result in an Error.

get_cuda_shared_memory_status(region_name='', headers=None, as_json=False, client_timeout=None)#

Request cuda shared memory status from the server.

Parameters:
  • region_name (str) – The name of the region to query status. The default value is an empty string, which means that the status of all active cuda shared memory will be returned.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns cuda shared memory status as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

The JSON dict or CudaSharedMemoryStatusResponse message holding the cuda shared memory status.

Return type:

dict or protobuf message

Raises:

InferenceServerException – If unable to get the status of specified shared memory or has timed out.

get_inference_statistics(model_name='', model_version='', headers=None, as_json=False, client_timeout=None)#

Get the inference statistics for the specified model name and version.

Parameters:
  • model_name (str) – The name of the model to get statistics. The default value is an empty string, which means statistics of all models will be returned.

  • model_version (str) – The version of the model to get inference statistics. The default value is an empty string which means then the server will return the statistics of all available model versions.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns inference statistics as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Raises:

InferenceServerException – If unable to get the model inference statistics or has timed out.

get_log_settings(headers=None, as_json=False, client_timeout=None)#

Get the global log settings.

Parameters:
  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns log settings as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

The JSON dict or LogSettingsResponse message holding the log settings.

Return type:

dict or protobuf message

Raises:

InferenceServerException – If unable to get the log settings or has timed out.

get_model_config(model_name, model_version='', headers=None, as_json=False, client_timeout=None)#

Contact the inference server and get the configuration for specified model.

Parameters:
  • model_name (str) – The name of the model

  • model_version (str) – The version of the model to get configuration. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns configuration as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

The JSON dict or ModelConfigResponse message holding the metadata.

Return type:

dict or protobuf message

Raises:

InferenceServerException – If unable to get model configuration or has timed out.

get_model_metadata(model_name, model_version='', headers=None, as_json=False, client_timeout=None)#

Contact the inference server and get the metadata for specified model.

Parameters:
  • model_name (str) – The name of the model

  • model_version (str) – The version of the model to get metadata. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns model metadata as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

The JSON dict or ModelMetadataResponse message holding the metadata.

Return type:

dict or protobuf message

Raises:

InferenceServerException – If unable to get model metadata or has timed out.

get_model_repository_index(headers=None, as_json=False, client_timeout=None)#

Get the index of model repository contents

Parameters:
  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns model repository index as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

The JSON dict or RepositoryIndexResponse message holding the model repository index.

Return type:

dict or protobuf message

get_server_metadata(headers=None, as_json=False, client_timeout=None)#

Contact the inference server and get its metadata.

Parameters:
  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns server metadata as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

The JSON dict or ServerMetadataResponse message holding the metadata.

Return type:

dict or protobuf message

Raises:

InferenceServerException – If unable to get server metadata or has timed out.

get_system_shared_memory_status(region_name='', headers=None, as_json=False, client_timeout=None)#

Request system shared memory status from the server.

Parameters:
  • region_name (str) – The name of the region to query status. The default value is an empty string, which means that the status of all active system shared memory will be returned.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns system shared memory status as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

The JSON dict or SystemSharedMemoryStatusResponse message holding the system shared memory status.

Return type:

dict or protobuf message

Raises:

InferenceServerException – If unable to get the status of specified shared memory or has timed out.

get_trace_settings(model_name=None, headers=None, as_json=False, client_timeout=None)#

Get the trace settings for the specified model name, or global trace settings if model name is not given

Parameters:
  • model_name (str) – The name of the model to get trace settings. Specifying None or empty string will return the global trace settings. The default value is None.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns trace settings as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

The JSON dict or TraceSettingResponse message holding the trace settings.

Return type:

dict or protobuf message

Raises:

InferenceServerException – If unable to get the trace settings or has timed out.

infer(model_name, inputs, model_version='', outputs=None, request_id='', sequence_id=0, sequence_start=False, sequence_end=False, priority=0, timeout=None, client_timeout=None, headers=None, compression_algorithm=None, parameters=None)#

Run synchronous inference using the supplied ‘inputs’ requesting the outputs specified by ‘outputs’.

Parameters:
  • model_name (str) – The name of the model to run inference.

  • inputs (list) – A list of InferInput objects, each describing data for a input tensor required by the model.

  • model_version (str) – The version of the model to run inference. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • outputs (list) – A list of InferRequestedOutput objects, each describing how the output data must be returned. If not specified all outputs produced by the model will be returned using default settings.

  • request_id (str) – Optional identifier for the request. If specified will be returned in the response. Default value is an empty string which means no request_id will be used.

  • sequence_id (int) – The unique identifier for the sequence being represented by the object. Default value is 0 which means that the request does not belong to a sequence.

  • sequence_start (bool) – Indicates whether the request being added marks the start of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • sequence_end (bool) – Indicates whether the request being added marks the end of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • priority (int) – Indicates the priority of the request. Priority value zero indicates that the default priority level should be used (i.e. same behavior as not specifying the priority parameter). Lower value priorities indicate higher priority levels. Thus the highest priority level is indicated by setting the parameter to 1, the next highest is 2, etc. If not provided, the server will handle the request using default setting for the model.

  • timeout (int) – The timeout value for the request, in microseconds. If the request cannot be completed within the time the server can take a model-specific action such as terminating the request. If not provided, the server will handle the request using default setting for the model. This option is only respected by the model that is configured with dynamic batching. See here for more details: triton-inference-server/server

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • compression_algorithm (str) – Optional grpc compression algorithm to be used on client side. Currently supports “deflate”, “gzip” and None. By default, no compression is used.

  • parameters (dict) – Optional custom parameters to be included in the inference request.

Returns:

The object holding the result of the inference.

Return type:

InferResult

Raises:

InferenceServerException – If server fails to perform inference.

is_model_ready(model_name, model_version='', headers=None, client_timeout=None)#

Contact the inference server and get the readiness of specified model.

Parameters:
  • model_name (str) – The name of the model to check for readiness.

  • model_version (str) – The version of the model to check for readiness. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

True if the model is ready, False if not ready.

Return type:

bool

Raises:

InferenceServerException – If unable to get model readiness or has timed out.

is_server_live(headers=None, client_timeout=None)#

Contact the inference server and get liveness.

Parameters:
  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

True if server is live, False if server is not live.

Return type:

bool

Raises:

InferenceServerException – If unable to get liveness or has timed out.

is_server_ready(headers=None, client_timeout=None)#

Contact the inference server and get readiness.

Parameters:
  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

True if server is ready, False if server is not ready.

Return type:

bool

Raises:

InferenceServerException – If unable to get readiness or has timed out.

load_model(model_name, headers=None, config=None, files=None, client_timeout=None)#

Request the inference server to load or reload specified model.

Parameters:
  • model_name (str) – The name of the model to be loaded.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • config (str) – Optional JSON representation of a model config provided for the load request, if provided, this config will be used for loading the model.

  • files (dict) – Optional dictionary specifying file path (with “file:” prefix) in the override model directory to the file content as bytes. The files will form the model directory that the model will be loaded from. If specified, ‘config’ must be provided to be the model configuration of the override model directory.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Raises:

InferenceServerException – If unable to load the model or has timed out.

register_cuda_shared_memory(name, raw_handle, device_id, byte_size, headers=None, client_timeout=None)#

Request the server to register a system shared memory with the following specification.

Parameters:
  • name (str) – The name of the region to register.

  • raw_handle (bytes) – The raw serialized cudaIPC handle in base64 encoding.

  • device_id (int) – The GPU device ID on which the cudaIPC handle was created.

  • byte_size (int) – The size of the cuda shared memory region, in bytes.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Raises:

InferenceServerException – If unable to register the specified cuda shared memory or has timed out.

register_system_shared_memory(name, key, byte_size, offset=0, headers=None, client_timeout=None)#

Request the server to register a system shared memory with the following specification.

Parameters:
  • name (str) – The name of the region to register.

  • key (str) – The key of the underlying memory object that contains the system shared memory region.

  • byte_size (int) – The size of the system shared memory region, in bytes.

  • offset (int) – Offset, in bytes, within the underlying memory object to the start of the system shared memory region. The default value is zero.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Raises:

InferenceServerException – If unable to register the specified system shared memory or has timed out.

start_stream(callback, stream_timeout=None, headers=None, compression_algorithm=None)#

Starts a grpc bi-directional stream to send streaming inferences. Note: When using stream, user must ensure the InferenceServerClient.close() gets called at exit.

Parameters:
  • callback (function) – Python function that is invoked upon receiving response from the underlying stream. The function must reserve the last two arguments (result, error) to hold InferResult and InferenceServerException objects respectively which will be provided to the function when executing the callback. The ownership of these objects will be given to the user. The ‘error’ would be None for a successful inference.

  • stream_timeout (float) – Optional stream timeout (in seconds). The stream will be closed once the specified timeout expires.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • compression_algorithm (str) – Optional grpc compression algorithm to be used on client side. Currently supports “deflate”, “gzip” and None. By default, no compression is used.

Raises:

InferenceServerException – If unable to start a stream or a stream was already running for this client or has timed out.

stop_stream(cancel_requests=False)#

Stops a stream if one available.

Parameters:

cancel_requests (bool) – If set True, then client cancels all the pending requests and closes the stream. If set False, the call blocks till all the pending requests on the stream are processed.

unload_model(model_name, headers=None, unload_dependents=False, client_timeout=None)#

Request the inference server to unload specified model.

Parameters:
  • model_name (str) – The name of the model to be unloaded.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • unload_dependents (bool) – Whether the dependents of the model should also be unloaded.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Raises:

InferenceServerException – If unable to unload the model or has timed out.

unregister_cuda_shared_memory(name='', headers=None, client_timeout=None)#

Request the server to unregister a cuda shared memory with the specified name.

Parameters:
  • name (str) – The name of the region to unregister. The default value is empty string which means all the cuda shared memory regions will be unregistered.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Raises:

InferenceServerException – If unable to unregister the specified cuda shared memory region or has timed out.

unregister_system_shared_memory(name='', headers=None, client_timeout=None)#

Request the server to unregister a system shared memory with the specified name.

Parameters:
  • name (str) – The name of the region to unregister. The default value is empty string which means all the system shared memory regions will be unregistered.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Raises:

InferenceServerException – If unable to unregister the specified system shared memory region or has timed out.

update_log_settings(settings, headers=None, as_json=False, client_timeout=None)#

Update the global log settings. Returns the log settings after the update.

Parameters:
  • settings (dict) – The new log setting values. Only the settings listed will be updated.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns trace settings as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

The JSON dict or LogSettingsResponse message holding the updated log settings.

Return type:

dict or protobuf message

Raises:

InferenceServerException – If unable to update the log settings or has timed out.

update_trace_settings(model_name=None, settings={}, headers=None, as_json=False, client_timeout=None)#

Update the trace settings for the specified model name, or global trace settings if model name is not given. Returns the trace settings after the update.

Parameters:
  • model_name (str) – The name of the model to update trace settings. Specifying None or empty string will update the global trace settings. The default value is None.

  • settings (dict) – The new trace setting values. Only the settings listed will be updated. If a trace setting is listed in the dictionary with a value of ‘None’, that setting will be cleared.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns trace settings as a json dict, otherwise as a protobuf message. Default value is False. The returned json is generated from the protobuf message using MessageToJson and as a result int64 values are represented as string. It is the caller’s responsibility to convert these strings back to int64 values as necessary.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

Returns:

The JSON dict or TraceSettingResponse message holding the updated trace settings.

Return type:

dict or protobuf message

Raises:

InferenceServerException – If unable to update the trace settings or has timed out.

class tritonclient.grpc.InferenceServerClientPlugin#

Every Triton Client Plugin should extend this class. Each plugin needs to implement the __call__() method.

abstract __call__(request)#

This method will be called when any of the client functions are invoked. Note that the request object must be modified in-place.

Parameters:

request (Request) – The request object.

_abc_impl = <_abc._abc_data object>#
exception tritonclient.grpc.InferenceServerException(msg, status=None, debug_details=None)#

Exception indicating non-Success status.

Parameters:
  • msg (str) – A brief description of error

  • status (str) – The error code

  • debug_details (str) – The additional details on the error

debug_details()#

Get the detailed information about the exception for debugging purposes

Returns:

Returns the exception details

Return type:

str

message()#

Get the exception message.

Returns:

The message associated with this exception, or None if no message.

Return type:

str

status()#

Get the status of the exception.

Returns:

Returns the status of the exception

Return type:

str

class tritonclient.grpc.KeepAliveOptions(keepalive_time_ms=2147483647, keepalive_timeout_ms=20000, keepalive_permit_without_calls=False, http2_max_pings_without_data=2)#

A KeepAliveOptions object is used to encapsulate GRPC KeepAlive related parameters for initiating an InferenceServerclient object.

See the grpc/grpc documentation for more information.

Parameters:
  • keepalive_time_ms (int) – The period (in milliseconds) after which a keepalive ping is sent on the transport. Default is INT32_MAX.

  • keepalive_timeout_ms (int) – The period (in milliseconds) the sender of the keepalive ping waits for an acknowledgement. If it does not receive an acknowledgment within this time, it will close the connection. Default is 20000 (20 seconds).

  • keepalive_permit_without_calls (bool) – Allows keepalive pings to be sent even if there are no calls in flight. Default is False.

  • http2_max_pings_without_data (int) – The maximum number of pings that can be sent when there is no data/header frame to be sent. gRPC Core will not continue sending pings if we run over the limit. Setting it to 0 allows sending pings without such a restriction. Default is 2.

class tritonclient.grpc.Request(headers)#

A request object.

Parameters:

headers (dict) – A dictionary containing the request headers.

Modules

tritonclient.grpc.aio

tritonclient.grpc.auth