Python API

GRPC Client

This module contains the GRPC client including the ability to send health, status, metadata and inference requests to a Triton server.

class tritongrpcclient.InferInput(name, shape, datatype)

An object of InferInput class is used to describe input tensor for an inference request.

Parameters
  • name (str) – The name of input whose data will be described by this object

  • shape (list) – The shape of the associated input.

  • datatype (str) – The datatype of the associated input.

datatype()

Get the datatype of input associated with this object.

Returns

The datatype of input

Return type

str

name()

Get the name of input associated with this object.

Returns

The name of input

Return type

str

set_data_from_numpy(input_tensor)

Set the tensor data from the specified numpy array for input associated with this object.

Parameters

input_tensor (numpy array) – The tensor data in numpy array format

Raises

InferenceServerException – If failed to set data for the tensor.

set_shape(shape)

Set the shape of input.

Parameters

shape (list) – The shape of the associated input.

set_shared_memory(region_name, byte_size, offset=0)

Set the tensor data from the specified shared memory region.

Parameters
  • region_name (str) – The name of the shared memory region holding tensor data.

  • byte_size (int) – The size of the shared memory region holding tensor data.

  • offset (int) – The offset, in bytes, into the region where the data for the tensor starts. The default value is 0.

shape()

Get the shape of input associated with this object.

Returns

The shape of input

Return type

list

class tritongrpcclient.InferRequestedOutput(name, class_count=0)

An object of InferRequestedOutput class is used to describe a requested output tensor for an inference request.

Parameters
  • name (str) – The name of output tensor to associate with this object

  • class_count (int) – The number of classifications to be requested. The default value is 0 which means the classification results are not requested.

name()

Get the name of output associated with this object.

Returns

The name of output

Return type

str

set_shared_memory(region_name, byte_size, offset=0)

Marks the output to return the inference result in specified shared memory region.

Parameters
  • region_name (str) – The name of the shared memory region to hold tensor data.

  • byte_size (int) – The size of the shared memory region to hold tensor data.

  • offset (int) – The offset, in bytes, into the region where the data for the tensor starts. The default value is 0.

Raises

InferenceServerException – If failed to set shared memory for the tensor.

unset_shared_memory()

Clears the shared memory option set by the last call to InferRequestedOutput.set_shared_memory(). After call to this function requested output will no longer be returned in a shared memory region.

class tritongrpcclient.InferResult(result)

An object of InferResult class holds the response of an inference request and provide methods to retrieve inference results.

Parameters

result (protobuf message) – The ModelInferResponse returned by the server

as_numpy(name)

Get the tensor data for output associated with this object in numpy format

Parameters

name (str) – The name of the output tensor whose result is to be retrieved.

Returns

The numpy array containing the response data for the tensor or None if the data for specified tensor name is not found.

Return type

numpy array

get_output(name, as_json=False)

Retrieves the InferOutputTensor corresponding to the named ouput.

Parameters
  • name (str) – The name of the tensor for which Output is to be retrieved.

  • as_json (bool) – If True then returns response as a json dict, otherwise as a protobuf message. Default value is False.

Returns

If a InferOutputTensor with specified name is present in ModelInferResponse then returns it as a protobuf messsage or dict, otherwise returns None.

Return type

protobuf message or dict

get_response(as_json=False)

Retrieves the complete ModelInferResponse as a json dict object or protobuf message

Parameters

as_json (bool) – If True then returns response as a json dict, otherwise as a protobuf message. Default value is False.

Returns

The underlying ModelInferResponse as a protobuf message or dict.

Return type

protobuf message or dict

class tritongrpcclient.InferenceServerClient(url, verbose=False, ssl=False, root_certificates=None, private_key=None, certificate_chain=None)

An InferenceServerClient object is used to perform any kind of communication with the InferenceServer using gRPC protocol. Most of the methods are thread-safe except start_stream, stop_stream and async_stream_infer. Accessing a client stream with different threads will cause undefined behavior.

Parameters
  • url (str) – The inference server URL, e.g. ‘localhost:8001’.

  • verbose (bool) – If True generate verbose output. Default value is False.

  • ssl (bool) – If True use SSL encrypted secure channel. Default is False.

  • root_certificates (str) – File holding the PEM-encoded root certificates as a byte string, or None to retrieve them from a default location chosen by gRPC runtime. The option is ignored if ssl is False. Default is None.

  • private_key (str) – File holding the PEM-encoded private key as a byte string, or None if no private key should be used. The option is ignored if ssl is False. Default is None.

  • certificate_chain (str) – File holding PEM-encoded certificate chain as a byte string to use or None if no certificate chain should be used. The option is ignored if ssl is False. Default is None.

Raises

Exception – If unable to create a client.

async_infer(model_name, inputs, callback, model_version='', outputs=None, request_id='', sequence_id=0, sequence_start=False, sequence_end=False, priority=0, timeout=None, client_timeout=None, headers=None)

Run asynchronous inference using the supplied ‘inputs’ requesting the outputs specified by ‘outputs’.

Parameters
  • model_name (str) – The name of the model to run inference.

  • inputs (list) – A list of InferInput objects, each describing data for a input tensor required by the model.

  • callback (function) – Python function that is invoked once the request is completed. The function must reserve the last two arguments (result, error) to hold InferResult and InferenceServerException objects respectively which will be provided to the function when executing the callback. The ownership of these objects will be given to the user. The ‘error’ would be None for a successful inference.

  • model_version (str) – The version of the model to run inference. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • outputs (list) – A list of InferRequestedOutput objects, each describing how the output data must be returned. If not specified all outputs produced by the model will be returned using default settings.

  • request_id (str) – Optional identifier for the request. If specified will be returned in the response. Default value is an empty string which means no request_id will be used.

  • sequence_id (int) – The unique identifier for the sequence being represented by the object. Default value is 0 which means that the request does not belong to a sequence.

  • sequence_start (bool) – Indicates whether the request being added marks the start of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • sequence_end (bool) – Indicates whether the request being added marks the end of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • priority (int) – Indicates the priority of the request. Priority value zero indicates that the default priority level should be used (i.e. same behavior as not specifying the priority parameter). Lower value priorities indicate higher priority levels. Thus the highest priority level is indicated by setting the parameter to 1, the next highest is 2, etc. If not provided, the server will handle the request using default setting for the model.

  • timeout (int) – The timeout value for the request, in microseconds. If the request cannot be completed within the time the server can take a model-specific action such as terminating the request. If not provided, the server will handle the request using default setting for the model.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and provide error with message “Deadline Exceeded” in the callback when the specified time elapses. The default value is None which means client will wait for the response from the server.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

Raises

InferenceServerException – If server fails to issue inference.

async_stream_infer(model_name, inputs, model_version='', outputs=None, request_id='', sequence_id=0, sequence_start=False, sequence_end=False, priority=0, timeout=None)

Runs an asynchronous inference over gRPC bi-directional streaming API. A stream must be established with a call to start_stream() before calling this function. All the results will be provided to the callback function associated with the stream.

Parameters
  • model_name (str) – The name of the model to run inference.

  • inputs (list) – A list of InferInput objects, each describing data for a input tensor required by the model.

  • model_version (str) – The version of the model to run inference. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • outputs (list) – A list of InferRequestedOutput objects, each describing how the output data must be returned. If not specified all outputs produced by the model will be returned using default settings.

  • request_id (str) – Optional identifier for the request. If specified will be returned in the response. Default value is an empty string which means no request_id will be used.

  • sequence_id (int) – The unique identifier for the sequence being represented by the object. Default value is 0 which means that the request does not belong to a sequence.

  • sequence_start (bool) – Indicates whether the request being added marks the start of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • sequence_end (bool) – Indicates whether the request being added marks the end of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • priority (int) – Indicates the priority of the request. Priority value zero indicates that the default priority level should be used (i.e. same behavior as not specifying the priority parameter). Lower value priorities indicate higher priority levels. Thus the highest priority level is indicated by setting the parameter to 1, the next highest is 2, etc. If not provided, the server will handle the request using default setting for the model.

  • timeout (int) – The timeout value for the request, in microseconds. If the request cannot be completed within the time the server can take a model-specific action such as terminating the request. If not provided, the server will handle the request using default setting for the model.

Raises

InferenceServerException – If server fails to issue inference.

close()

Close the client. Any future calls to server will result in an Error.

get_cuda_shared_memory_status(region_name='', headers=None, as_json=False)

Request cuda shared memory status from the server.

Parameters
  • region_name (str) – The name of the region to query status. The default value is an empty string, which means that the status of all active cuda shared memory will be returned.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns cuda shared memory status as a json dict, otherwise as a protobuf message. Default value is False.

Returns

The JSON dict or CudaSharedMemoryStatusResponse message holding the cuda shared memory status.

Return type

dict or protobuf message

Raises

InferenceServerException – If unable to get the status of specified shared memory.

get_inference_statistics(model_name='', model_version='', headers=None, as_json=False)

Get the inference statistics for the specified model name and version.

Parameters
  • model_name (str) – The name of the model to get statistics. The default value is an empty string, which means statistics of all models will be returned.

  • model_version (str) – The version of the model to get inference statistics. The default value is an empty string which means then the server will return the statistics of all available model versions.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns inference statistics as a json dict, otherwise as a protobuf message. Default value is False.

Raises

InferenceServerException – If unable to get the model inference statistics.

get_model_config(model_name, model_version='', headers=None, as_json=False)

Contact the inference server and get the configuration for specified model.

Parameters
  • model_name (str) – The name of the model

  • model_version (str) – The version of the model to get configuration. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns configuration as a json dict, otherwise as a protobuf message. Default value is False.

Returns

The JSON dict or ModelConfigResponse message holding the metadata.

Return type

dict or protobuf message

Raises

InferenceServerException – If unable to get model configuration.

get_model_metadata(model_name, model_version='', headers=None, as_json=False)

Contact the inference server and get the metadata for specified model.

Parameters
  • model_name (str) – The name of the model

  • model_version (str) – The version of the model to get metadata. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns model metadata as a json dict, otherwise as a protobuf message. Default value is False.

Returns

The JSON dict or ModelMetadataResponse message holding the metadata.

Return type

dict or protobuf message

Raises

InferenceServerException – If unable to get model metadata.

get_model_repository_index(headers=None, as_json=False)

Get the index of model repository contents

Parameters
  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns model repository index as a json dict, otherwise as a protobuf message. Default value is False.

Returns

The JSON dict or RepositoryIndexResponse message holding the model repository index.

Return type

dict or protobuf message

get_server_metadata(headers=None, as_json=False)

Contact the inference server and get its metadata.

Parameters
  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns server metadata as a json dict, otherwise as a protobuf message. Default value is False.

Returns

The JSON dict or ServerMetadataResponse message holding the metadata.

Return type

dict or protobuf message

Raises

InferenceServerException – If unable to get server metadata.

get_system_shared_memory_status(region_name='', headers=None, as_json=False)

Request system shared memory status from the server.

Parameters
  • region_name (str) – The name of the region to query status. The default value is an empty string, which means that the status of all active system shared memory will be returned.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • as_json (bool) – If True then returns system shared memory status as a json dict, otherwise as a protobuf message. Default value is False.

Returns

The JSON dict or SystemSharedMemoryStatusResponse message holding the system shared memory status.

Return type

dict or protobuf message

Raises

InferenceServerException – If unable to get the status of specified shared memory.

infer(model_name, inputs, model_version='', outputs=None, request_id='', sequence_id=0, sequence_start=False, sequence_end=False, priority=0, timeout=None, client_timeout=None, headers=None)

Run synchronous inference using the supplied ‘inputs’ requesting the outputs specified by ‘outputs’.

Parameters
  • model_name (str) – The name of the model to run inference.

  • inputs (list) – A list of InferInput objects, each describing data for a input tensor required by the model.

  • model_version (str) – The version of the model to run inference. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • outputs (list) – A list of InferRequestedOutput objects, each describing how the output data must be returned. If not specified all outputs produced by the model will be returned using default settings.

  • request_id (str) – Optional identifier for the request. If specified will be returned in the response. Default value is an empty string which means no request_id will be used.

  • sequence_id (int) – The unique identifier for the sequence being represented by the object. Default value is 0 which means that the request does not belong to a sequence.

  • sequence_start (bool) – Indicates whether the request being added marks the start of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • sequence_end (bool) – Indicates whether the request being added marks the end of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • priority (int) – Indicates the priority of the request. Priority value zero indicates that the default priority level should be used (i.e. same behavior as not specifying the priority parameter). Lower value priorities indicate higher priority levels. Thus the highest priority level is indicated by setting the parameter to 1, the next highest is 2, etc. If not provided, the server will handle the request using default setting for the model.

  • timeout (int) – The timeout value for the request, in microseconds. If the request cannot be completed within the time the server can take a model-specific action such as terminating the request. If not provided, the server will handle the request using default setting for the model.

  • client_timeout (float) – The maximum end-to-end time, in seconds, the request is allowed to take. The client will abort request and raise InferenceServerExeption with message “Deadline Exceeded” when the specified time elapses. The default value is None which means client will wait for the response from the server.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

Returns

The object holding the result of the inference.

Return type

InferResult

Raises

InferenceServerException – If server fails to perform inference.

is_model_ready(model_name, model_version='', headers=None)

Contact the inference server and get the readiness of specified model.

Parameters
  • model_name (str) – The name of the model to check for readiness.

  • model_version (str) – The version of the model to check for readiness. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

Returns

True if the model is ready, False if not ready.

Return type

bool

Raises

InferenceServerException – If unable to get model readiness.

is_server_live(headers=None)

Contact the inference server and get liveness.

Parameters

headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

Returns

True if server is live, False if server is not live.

Return type

bool

Raises

InferenceServerException – If unable to get liveness.

is_server_ready(headers=None)

Contact the inference server and get readiness.

Parameters

headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

Returns

True if server is ready, False if server is not ready.

Return type

bool

Raises

InferenceServerException – If unable to get readiness.

load_model(model_name, headers=None)

Request the inference server to load or reload specified model.

Parameters
  • model_name (str) – The name of the model to be loaded.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

Raises

InferenceServerException – If unable to load the model.

register_cuda_shared_memory(name, raw_handle, device_id, byte_size, headers=None)

Request the server to register a system shared memory with the following specification.

Parameters
  • name (str) – The name of the region to register.

  • raw_handle (bytes) – The raw serialized cudaIPC handle in base64 encoding.

  • device_id (int) – The GPU device ID on which the cudaIPC handle was created.

  • byte_size (int) – The size of the cuda shared memory region, in bytes.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

Raises

InferenceServerException – If unable to register the specified cuda shared memory.

register_system_shared_memory(name, key, byte_size, offset=0, headers=None)

Request the server to register a system shared memory with the following specification.

Parameters
  • name (str) – The name of the region to register.

  • key (str) – The key of the underlying memory object that contains the system shared memory region.

  • byte_size (int) – The size of the system shared memory region, in bytes.

  • offset (int) – Offset, in bytes, within the underlying memory object to the start of the system shared memory region. The default value is zero.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

Raises

InferenceServerException – If unable to register the specified system shared memory.

start_stream(callback, stream_timeout=None, headers=None)

Starts a grpc bi-directional stream to send streaming inferences. Note: When using stream, user must ensure the InferenceServerClient.close() gets called at exit.

Parameters
  • callback (function) – Python function that is invoked upon receiving response from the underlying stream. The function must reserve the last two arguments (result, error) to hold InferResult and InferenceServerException objects respectively which will be provided to the function when executing the callback. The ownership of these objects will be given to the user. The ‘error’ would be None for a successful inference.

  • stream_timeout (float) – Optional stream timeout. The stream will be closed once the specified timeout expires.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

Raises

InferenceServerException – If unable to start a stream or a stream was already running for this client.

stop_stream()

Stops a stream if one available.

unload_model(model_name, headers=None)

Request the inference server to unload specified model.

Parameters
  • model_name (str) – The name of the model to be unloaded.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

Raises

InferenceServerException – If unable to unload the model.

unregister_cuda_shared_memory(name='', headers=None)

Request the server to unregister a cuda shared memory with the specified name.

Parameters
  • name (str) – The name of the region to unregister. The default value is empty string which means all the cuda shared memory regions will be unregistered.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

Raises

InferenceServerException – If unable to unregister the specified cuda shared memory region.

unregister_system_shared_memory(name='', headers=None)

Request the server to unregister a system shared memory with the specified name.

Parameters
  • name (str) – The name of the region to unregister. The default value is empty string which means all the system shared memory regions will be unregistered.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

Raises

InferenceServerException – If unable to unregister the specified system shared memory region.

HTTP/REST Client

This module contains the HTTP/REST client including the ability to send health, status, metadata and inference requests to a Triton server.

class tritonhttpclient.InferAsyncRequest(greenlet, verbose=False)

An object of InferAsyncRequest class is used to describe a handle to an ongoing asynchronous inference request.

Parameters
  • greenlet (gevent.Greenlet) – The greenlet object which will provide the results. For further details about greenlets refer http://www.gevent.org/api/gevent.greenlet.html.

  • verbose (bool) – If True generate verbose output. Default value is False.

get_result(block=True, timeout=None)

Get the results of the associated asynchronous inference. :param block: If block is True, the function will wait till the

corresponding response is received from the server. Default value is True.

Parameters

timeout (int) – The maximum wait time for the function. This setting is ignored if the block is set False. Default is None, which means the function will block indefinitely till the corresponding response is received.

Returns

The object holding the result of the async inference.

Return type

InferResult

Raises

InferenceServerException – If server fails to perform inference or failed to respond within specified timeout.

class tritonhttpclient.InferInput(name, shape, datatype)

An object of InferInput class is used to describe input tensor for an inference request.

Parameters
  • name (str) – The name of input whose data will be described by this object

  • shape (list) – The shape of the associated input.

  • datatype (str) – The datatype of the associated input.

datatype()

Get the datatype of input associated with this object.

Returns

The datatype of input

Return type

str

name()

Get the name of input associated with this object.

Returns

The name of input

Return type

str

set_data_from_numpy(input_tensor, binary_data=True)

Set the tensor data from the specified numpy array for input associated with this object.

Parameters
  • input_tensor (numpy array) – The tensor data in numpy array format

  • binary_data (bool) – Indicates whether to set data for the input in binary format or explicit tensor within JSON. The default value is True, which means the data will be delivered as binary data in the HTTP body after the JSON object.

Raises

InferenceServerException – If failed to set data for the tensor.

set_shape(shape)

Set the shape of input.

Parameters

shape (list) – The shape of the associated input.

set_shared_memory(region_name, byte_size, offset=0)

Set the tensor data from the specified shared memory region.

Parameters
  • region_name (str) – The name of the shared memory region holding tensor data.

  • byte_size (int) – The size of the shared memory region holding tensor data.

  • offset (int) – The offset, in bytes, into the region where the data for the tensor starts. The default value is 0.

shape()

Get the shape of input associated with this object.

Returns

The shape of input

Return type

list

class tritonhttpclient.InferRequestedOutput(name, binary_data=True, class_count=0)

An object of InferRequestedOutput class is used to describe a requested output tensor for an inference request.

Parameters
  • name (str) – The name of output tensor to associate with this object.

  • binary_data (bool) – Indicates whether to return result data for the output in binary format or explicit tensor within JSON. The default value is True, which means the data will be delivered as binary data in the HTTP body after JSON object. This field will be unset if shared memory is set for the output.

  • class_count (int) – The number of classifications to be requested. The default value is 0 which means the classification results are not requested.

name()

Get the name of output associated with this object.

Returns

The name of output

Return type

str

set_shared_memory(region_name, byte_size, offset=0)

Marks the output to return the inference result in specified shared memory region.

Parameters
  • region_name (str) – The name of the shared memory region to hold tensor data.

  • byte_size (int) – The size of the shared memory region to hold tensor data.

  • offset (int) – The offset, in bytes, into the region where the data for the tensor starts. The default value is 0.

unset_shared_memory()

Clears the shared memory option set by the last call to InferRequestedOutput.set_shared_memory(). After call to this function requested output will no longer be returned in a shared memory region.

class tritonhttpclient.InferResult(response, verbose)

An object of InferResult class holds the response of an inference request and provide methods to retrieve inference results.

Parameters
  • result (dict) – The inference response from the server

  • verbose (bool) – If True generate verbose output. Default value is False.

as_numpy(name)

Get the tensor data for output associated with this object in numpy format

Parameters

name (str) – The name of the output tensor whose result is to be retrieved.

Returns

The numpy array containing the response data for the tensor or None if the data for specified tensor name is not found.

Return type

numpy array

get_output(name)

Retrieves the output tensor corresponding to the named ouput.

Parameters

name (str) – The name of the tensor for which Output is to be retrieved.

Returns

If an output tensor with specified name is present in the infer resonse then returns it as a json dict, otherwise returns None.

Return type

Dict

get_response()

Retrieves the complete response

Returns

The underlying response dict.

Return type

dict

class tritonhttpclient.InferenceServerClient(url, verbose=False, concurrency=1, connection_timeout=60.0, network_timeout=60.0, max_greenlets=None, ssl=False, ssl_options=None, ssl_context_factory=None, insecure=False)

An InferenceServerClient object is used to perform any kind of communication with the InferenceServer using http protocol. None of the methods are thread safe. The object is intended to be used by a single thread and simultaneously calling different methods with different threads is not supported and will cause undefined behavior.

Parameters
  • url (str) – The inference server URL, e.g. ‘localhost:8000’.

  • verbose (bool) – If True generate verbose output. Default value is False.

  • concurrency (int) – The number of connections to create for this client. Default value is 1.

  • connection_timeout (float) – The timeout value for the connection. Default value is 60.0 sec.

  • network_timeout (float) – The timeout value for the network. Default value is 60.0 sec

  • max_greenlets (int) – Determines the maximum allowed number of worker greenlets for handling asynchronous inference requests. Default value is None, which means there will be no restriction on the number of greenlets created.

  • ssl (bool) – If True, channels the requests to encrypted https scheme. Default value is False.

  • ssl_options (dict) – Any options supported by ssl.wrap_socket specified as dictionary. The argument is ignored if ‘ssl’ is specified False.

  • ssl_context_factory (SSLContext callable) – It must be a callbable that returns a SSLContext. The default value is None which use ssl.create_default_context. The argument is ignored if ‘ssl’ is specified False.

  • insecure (bool) – If True, then does not match the host name with the certificate. Default value is False. The argument is ignored if ‘ssl’ is specified False.

  • Raises

    Exception

    If unable to create a client.

async_infer(model_name, inputs, model_version='', outputs=None, request_id='', sequence_id=0, sequence_start=False, sequence_end=False, priority=0, timeout=None, headers=None, query_params=None)

Run asynchronous inference using the supplied ‘inputs’ requesting the outputs specified by ‘outputs’. Even though this call is non-blocking, however, the actual number of concurrent requests to the server will be limited by the ‘concurrency’ parameter specified while creating this client. In other words, if the inflight async_infer exceeds the specified ‘concurrency’, the delivery of the exceeding request(s) to server will be blocked till the slot is made available by retrieving the results of previously issued requests.

Parameters
  • model_name (str) – The name of the model to run inference.

  • inputs (list) – A list of InferInput objects, each describing data for a input tensor required by the model.

  • model_version (str) – The version of the model to run inference. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • outputs (list) – A list of InferRequestedOutput objects, each describing how the output data must be returned. If not specified all outputs produced by the model will be returned using default settings.

  • request_id (str) – Optional identifier for the request. If specified will be returned in the response. Default value is ‘None’ which means no request_id will be used.

  • sequence_id (int) – The unique identifier for the sequence being represented by the object. Default value is 0 which means that the request does not belong to a sequence.

  • sequence_start (bool) – Indicates whether the request being added marks the start of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • sequence_end (bool) – Indicates whether the request being added marks the end of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • priority (int) – Indicates the priority of the request. Priority value zero indicates that the default priority level should be used (i.e. same behavior as not specifying the priority parameter). Lower value priorities indicate higher priority levels. Thus the highest priority level is indicated by setting the parameter to 1, the next highest is 2, etc. If not provided, the server will handle the request using default setting for the model.

  • timeout (int) – The timeout value for the request, in microseconds. If the request cannot be completed within the time the server can take a model-specific action such as terminating the request. If not provided, the server will handle the request using default setting for the model.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request

  • query_params (dict) – Optional url query parameters to use in network transaction.

Returns

The handle to the asynchronous inference request.

Return type

InferAsyncRequest object

Raises

InferenceServerException – If server fails to issue inference.

close()

Close the client. Any future calls to server will result in an Error.

get_cuda_shared_memory_status(region_name='', headers=None, query_params=None)

Request cuda shared memory status from the server.

Parameters
  • region_name (str) – The name of the region to query status. The default value is an empty string, which means that the status of all active cuda shared memory will be returned.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request

  • query_params (dict) – Optional url query parameters to use in network transaction

Returns

The JSON dict holding cuda shared memory status.

Return type

dict

Raises

InferenceServerException – If unable to get the status of specified shared memory.

get_inference_statistics(model_name='', model_version='', headers=None, query_params=None)

Get the inference statistics for the specified model name and version.

Parameters
  • model_name (str) – The name of the model to get statistics. The default value is an empty string, which means statistics of all models will be returned.

  • model_version (str) – The version of the model to get inference statistics. The default value is an empty string which means then the server will return the statistics of all available model versions.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • query_params (dict) – Optional url query parameters to use in network transaction

Returns

The JSON dict holding the model inference statistics.

Return type

dict

Raises

InferenceServerException – If unable to get the model inference statistics.

get_model_config(model_name, model_version='', headers=None, query_params=None)

Contact the inference server and get the configuration for specified model.

Parameters
  • model_name (str) – The name of the model

  • model_version (str) – The version of the model to get configuration. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request

  • query_params (dict) – Optional url query parameters to use in network transaction

Returns

The JSON dict holding the model config.

Return type

dict

Raises

InferenceServerException – If unable to get model configuration.

get_model_metadata(model_name, model_version='', headers=None, query_params=None)

Contact the inference server and get the metadata for specified model.

Parameters
  • model_name (str) – The name of the model

  • model_version (str) – The version of the model to get metadata. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request

  • query_params (dict) – Optional url query parameters to use in network transaction

Returns

The JSON dict holding the metadata.

Return type

dict

Raises

InferenceServerException – If unable to get model metadata.

get_model_repository_index(headers=None, query_params=None)

Get the index of model repository contents

Parameters
  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request

  • query_params (dict) – Optional url query parameters to use in network transaction

Returns

The JSON dict holding the model repository index.

Return type

dict

Raises

InferenceServerException – If unable to get the repository index.

get_server_metadata(headers=None, query_params=None)

Contact the inference server and get its metadata.

Parameters
  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • query_params (dict) – Optional url query parameters to use in network transaction.

Returns

The JSON dict holding the metadata.

Return type

dict

Raises

InferenceServerException – If unable to get server metadata.

get_system_shared_memory_status(region_name='', headers=None, query_params=None)

Request system shared memory status from the server.

Parameters
  • region_name (str) – The name of the region to query status. The default value is an empty string, which means that the status of all active system shared memory will be returned.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request

  • query_params (dict) – Optional url query parameters to use in network transaction

Returns

The JSON dict holding system shared memory status.

Return type

dict

Raises

InferenceServerException – If unable to get the status of specified shared memory.

infer(model_name, inputs, model_version='', outputs=None, request_id='', sequence_id=0, sequence_start=False, sequence_end=False, priority=0, timeout=None, headers=None, query_params=None)

Run synchronous inference using the supplied ‘inputs’ requesting the outputs specified by ‘outputs’.

Parameters
  • model_name (str) – The name of the model to run inference.

  • inputs (list) – A list of InferInput objects, each describing data for a input tensor required by the model.

  • model_version (str) – The version of the model to run inference. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • outputs (list) – A list of InferRequestedOutput objects, each describing how the output data must be returned. If not specified all outputs produced by the model will be returned using default settings.

  • request_id (str) – Optional identifier for the request. If specified will be returned in the response. Default value is an empty string which means no request_id will be used.

  • sequence_id (int) – The unique identifier for the sequence being represented by the object. Default value is 0 which means that the request does not belong to a sequence.

  • sequence_start (bool) – Indicates whether the request being added marks the start of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • sequence_end (bool) – Indicates whether the request being added marks the end of the sequence. Default value is False. This argument is ignored if ‘sequence_id’ is 0.

  • priority (int) – Indicates the priority of the request. Priority value zero indicates that the default priority level should be used (i.e. same behavior as not specifying the priority parameter). Lower value priorities indicate higher priority levels. Thus the highest priority level is indicated by setting the parameter to 1, the next highest is 2, etc. If not provided, the server will handle the request using default setting for the model.

  • timeout (int) – The timeout value for the request, in microseconds. If the request cannot be completed within the time the server can take a model-specific action such as terminating the request. If not provided, the server will handle the request using default setting for the model.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • query_params (dict) – Optional url query parameters to use in network transaction.

Returns

The object holding the result of the inference.

Return type

InferResult

Raises

InferenceServerException – If server fails to perform inference.

is_model_ready(model_name, model_version='', headers=None, query_params=None)

Contact the inference server and get the readiness of specified model.

Parameters
  • model_name (str) – The name of the model to check for readiness.

  • model_version (str) – The version of the model to check for readiness. The default value is an empty string which means then the server will choose a version based on the model and internal policy.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • query_params (dict) – Optional url query parameters to use in network transaction.

Returns

True if the model is ready, False if not ready.

Return type

bool

Raises

Exception – If unable to get model readiness.

is_server_live(headers=None, query_params=None)

Contact the inference server and get liveness.

Parameters
  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • query_params (dict) – Optional url query parameters to use in network transaction.

Returns

True if server is live, False if server is not live.

Return type

bool

Raises

Exception – If unable to get liveness.

is_server_ready(headers=None, query_params=None)

Contact the inference server and get readiness.

Parameters
  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request.

  • query_params (dict) – Optional url query parameters to use in network transaction.

Returns

True if server is ready, False if server is not ready.

Return type

bool

Raises

Exception – If unable to get readiness.

load_model(model_name, headers=None, query_params=None)

Request the inference server to load or reload specified model.

Parameters
  • model_name (str) – The name of the model to be loaded.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request

  • query_params (dict) – Optional url query parameters to use in network transaction

Raises

InferenceServerException – If unable to load the model.

register_cuda_shared_memory(name, raw_handle, device_id, byte_size, headers=None, query_params=None)

Request the server to register a system shared memory with the following specification.

Parameters
  • name (str) – The name of the region to register.

  • raw_handle (bytes) – The raw serialized cudaIPC handle in base64 encoding.

  • device_id (int) – The GPU device ID on which the cudaIPC handle was created.

  • byte_size (int) – The size of the cuda shared memory region, in bytes.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request

  • query_params (dict) – Optional url query parameters to use in network transaction

Raises

InferenceServerException – If unable to register the specified cuda shared memory.

register_system_shared_memory(name, key, byte_size, offset=0, headers=None, query_params=None)

Request the server to register a system shared memory with the following specification.

Parameters
  • name (str) – The name of the region to register.

  • key (str) – The key of the underlying memory object that contains the system shared memory region.

  • byte_size (int) – The size of the system shared memory region, in bytes.

  • offset (int) – Offset, in bytes, within the underlying memory object to the start of the system shared memory region. The default value is zero.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request

  • query_params (dict) – Optional url query parameters to use in network transaction

Raises

InferenceServerException – If unable to register the specified system shared memory.

unload_model(model_name, headers=None, query_params=None)

Request the inference server to unload specified model.

Parameters
  • model_name (str) – The name of the model to be unloaded.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request

  • query_params (dict) – Optional url query parameters to use in network transaction

Raises

InferenceServerException – If unable to unload the model.

unregister_cuda_shared_memory(name='', headers=None, query_params=None)

Request the server to unregister a cuda shared memory with the specified name.

Parameters
  • name (str) – The name of the region to unregister. The default value is empty string which means all the cuda shared memory regions will be unregistered.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request

  • query_params (dict) – Optional url query parameters to use in network transaction

Raises

InferenceServerException – If unable to unregister the specified cuda shared memory region.

unregister_system_shared_memory(name='', headers=None, query_params=None)

Request the server to unregister a system shared memory with the specified name.

Parameters
  • name (str) – The name of the region to unregister. The default value is empty string which means all the system shared memory regions will be unregistered.

  • headers (dict) – Optional dictionary specifying additional HTTP headers to include in the request

  • query_params (dict) – Optional url query parameters to use in network transaction

Raises

InferenceServerException – If unable to unregister the specified system shared memory region.

Client Utilities

This module includes common utilities used by both GRPC and HTTP/REST clients.

tritonclientutils.raise_error(msg)

Raise error with the provided message

exception tritonclientutils.InferenceServerException(msg, status=None, debug_details=None)

Exception indicating non-Success status.

Parameters
  • msg (str) – A brief description of error

  • status (str) – The error code

  • debug_details (str) – The additional details on the error

debug_details()

Get the detailed information about the exception for debugging purposes

Returns

Returns the exception details

Return type

str

message()

Get the exception message.

Returns

The message associated with this exception, or None if no message.

Return type

str

status()

Get the status of the exception.

Returns

Returns the status of the exception

Return type

str

tritonclientutils.serialize_byte_tensor(input_tensor)

Serializes a bytes tensor into a flat numpy array of length prepend bytes. Can pass bytes tensor as numpy array of bytes with dtype of np.bytes_, numpy strings with dtype of np.str_ or python strings with dtype of np.object.

Parameters

input_tensor (np.array) – The bytes tensor to serialize.

Returns

serialized_bytes_tensor – The 1-D numpy array of type uint8 containing the serialized bytes in ‘C’ order.

Return type

np.array

Raises

InferenceServerException – If unable to serialize the given tensor.

tritonclientutils.deserialize_bytes_tensor(encoded_tensor)

Deserializes an encoded bytes tensor into an numpy array of dtype of python objects

Parameters

encoded_tensor (bytes) – The encoded bytes tensor where each element has its length in first 4 bytes followed by the content

Returns

string_tensor – The 1-D numpy array of type object containing the deserialized bytes in ‘C’ order.

Return type

np.array

Shared Memory Utilities

This module contains an example API for accessing system and CUDA shared memory for use with Triton.

exception tritonshmutils.shared_memory.SharedMemoryException(err)

Exception indicating non-Success status.

Parameters

err (c_void_p) – Pointer to an Error that should be used to initialize the exception.

tritonshmutils.shared_memory.create_shared_memory_region(triton_shm_name, shm_key, byte_size)

Creates a shared memory region with the specified name and size.

Parameters
  • triton_shm_name (str) – The unique name of the shared memory region to be created.

  • shm_key (str) – The unique key of the shared memory object.

  • byte_size (int) – The size in bytes of the shared memory region to be created.

Returns

shm_handle – The handle for the shared memory region.

Return type

c_void_p

Raises

SharedMemoryException – If unable to create the shared memory region.

tritonshmutils.shared_memory.destroy_shared_memory_region(shm_handle)

Unlink a shared memory region with the specified handle.

Parameters

shm_handle (c_void_p) – The handle for the shared memory region.

Raises

SharedMemoryException – If unable to unlink the shared memory region.

tritonshmutils.shared_memory.get_contents_as_numpy(shm_handle, datatype, shape)

Generates a numpy array using the data stored in the shared memory region specified with the handle.

Parameters
  • cuda_shm_handle (c_void_p) – The handle for the cuda shared memory region.

  • datatype (np.dtype) – The datatype of the array to be returned.

  • shape (list) – The list of int describing the shape of the array to be returned.

Returns

The numpy array generated using contents from the specified shared memory region.

Return type

np.array

tritonshmutils.shared_memory.set_shared_memory_region(shm_handle, input_values)

Copy the contents of the numpy array into a shared memory region.

Parameters
  • shm_handle (c_void_p) – The handle for the shared memory region.

  • input_values (list) – The list of numpy arrays to be copied into the shared memory region.

Raises

SharedMemoryException – If unable to mmap or set values in the shared memory region.

exception tritonshmutils.cuda_shared_memory.CudaSharedMemoryException(err)

Exception indicating non-Success status.

Parameters

err (c_void_p) – Pointer to an Error that should be used to initialize the exception.

tritonshmutils.cuda_shared_memory.create_shared_memory_region(triton_shm_name, byte_size, device_id)

Creates a shared memory region with the specified name and size.

Parameters
  • triton_shm_name (str) – The unique name of the cuda shared memory region to be created.

  • byte_size (int) – The size in bytes of the cuda shared memory region to be created.

  • device_id (int) – The GPU device ID of the cuda shared memory region to be created.

Returns

cuda_shm_handle – The handle for the cuda shared memory region.

Return type

c_void_p

Raises

CudaSharedMemoryException – If unable to create the cuda shared memory region on the specified device.

tritonshmutils.cuda_shared_memory.destroy_shared_memory_region(cuda_shm_handle)

Close a cuda shared memory region with the specified handle.

Parameters

cuda_shm_handle (c_void_p) – The handle for the cuda shared memory region.

Raises

CudaSharedMemoryException – If unable to close the cuda_shm_handle shared memory region and free the device memory.

tritonshmutils.cuda_shared_memory.get_contents_as_numpy(cuda_shm_handle, datatype, shape)

Generates a numpy array using the data stored in the shared memory region specified with the handle.

Parameters
  • cuda_shm_handle (c_void_p) – The handle for the cuda shared memory region.

  • datatype (np.dtype) – The datatype of the array to be returned.

  • shape (list) – The list of int describing the shape of the array to be returned.

Returns

The numpy array generated using contents from the specified shared memory region.

Return type

np.array

tritonshmutils.cuda_shared_memory.get_raw_handle(cuda_shm_handle)

Returns the underlying raw serialized cudaIPC handle in base64 encoding.

Parameters

cuda_shm_handle (c_void_p) – The handle for the cuda shared memory region.

Returns

The raw serialized cudaIPC handle of underlying cuda shared memory in base64 encoding

Return type

bytes

tritonshmutils.cuda_shared_memory.set_shared_memory_region(cuda_shm_handle, input_values)

Copy the contents of the numpy array into a shared memory region with the specified identifier, offset and size.

Parameters
  • cuda_shm_handle (c_void_p) – The handle for the cuda shared memory region.

  • input_values (list) – The list of numpy arrays to be copied into the shared memory region.

Raises

CudaSharedMemoryException – If unable to set values in the cuda shared memory region.