Python API¶
Client¶
-
class
tensorrtserver.api.
InferContext
(url, protocol, model_name, model_version=None, verbose=False, correlation_id=0, streaming=False, http_headers=[])¶ An InferContext object is used to run inference on an inference server for a specific model.
Once created an InferContext object can be used repeatedly to perform inference using the model.
- Parameters
url (str) – The inference server URL, e.g. localhost:8000.
protocol (ProtocolType) – The protocol used to communicate with the server.
model_name (str) – The name of the model to use for inference.
model_version (int) – The version of the model to use for inference, or None to indicate that the latest (i.e. highest version number) version should be used.
verbose (bool) – If True generate verbose output.
correlation_id (int) – The correlation ID for the inference. If not specified (or if specified as 0), the inference will have no correlation ID.
streaming (bool) – If True create streaming context. Streaming is only allowed with gRPC protocol.
http_headers (list of strings) – HTTP headers to send with request. Ignored for GRPC protocol. Each header must be specified as “Header:Value”.
-
class
ResultFormat
¶ Formats for output tensor results.
- RAW
All values of the output are returned as an numpy array of the appropriate type.
- CLASS
Specified as tuple (CLASS, k). Top ‘k’ results are returned as an array of (index, value, label) tuples.
-
async_run
(inputs, outputs, batch_size=1, flags=0)¶ DEPRECATED: This function is deprecated and will be removed in a future version of this API. Instead use async_run_with_cb().
Run inference using the supplied ‘inputs’ to calculate the outputs specified by ‘outputs’.
Unlike run(), async_run() returns immediately after sending the inference request to the server. The returned integer identifier must be used subsequently to wait on and retrieve the actual inference results.
- Parameters
inputs (dict) – Dictionary from input name to the value(s) for that input. An input value is specified as a numpy array. Each input in the dictionary maps to a list of values (i.e. a list of numpy array objects), where the length of the list must equal the ‘batch_size’.
outputs (dict) – Dictionary from output name to a value indicating the ResultFormat that should be used for that output. For RAW the value should be ResultFormat.RAW. For CLASS the value should be a tuple (ResultFormat.CLASS, k), where ‘k’ indicates how many classification results should be returned for the output.
batch_size (int) – The batch size of the inference. Each input must provide an appropriately sized batch of inputs.
flags (int) – The flags to use for the inference. The bitwise-or of InferRequestHeader.Flag values.
- Returns
Integer identifier which must be passed to get_async_run_results() to wait on and retrieve the inference results.
- Return type
int
- Raises
InferenceServerException – If all inputs are not specified, if the size of input data does not match expectations, if unknown output names are specified or if server fails to perform inference.
-
async_run_with_cb
(callback, inputs, outputs, batch_size=1, flags=0)¶ Run inference using the supplied ‘inputs’ to calculate the outputs specified by ‘outputs’.
Similar to AsyncRun() above. However, this function does not return the integer identifier. Instead, once the request is completed, the InferContext object and the integer identifier will be passed to the provided ‘callback’ function. It is the function caller’s choice on either retrieving the results inside the callback function or deferring it to a different thread so that the InferContext is unblocked.
- Parameters
callback (function) – Python function that accepts an InferContext object that sends the request and an integer identifier as arguments. This function will be invoked once the request is completed.
inputs (dict) – Dictionary from input name to the value(s) for that input. An input value is specified as a numpy array. Each input in the dictionary maps to a list of values (i.e. a list of numpy array objects), where the length of the list must equal the ‘batch_size’.
outputs (dict) – Dictionary from output name to a value indicating the ResultFormat that should be used for that output. For RAW the value should be ResultFormat.RAW. For CLASS the value should be a tuple (ResultFormat.CLASS, k), where ‘k’ indicates how many classification results should be returned for the output.
batch_size (int) – The batch size of the inference. Each input must provide an appropriately sized batch of inputs.
flags (int) – The flags to use for the inference. The bitwise-or of InferRequestHeader.Flag values.
- Returns
Integer identifier which must be passed to get_async_run_results() to wait on and retrieve the inference results.
- Return type
int
- Raises
InferenceServerException – If all inputs are not specified, if the size of input data does not match expectations, if unknown output names are specified or if server fails to perform inference.
-
close
()¶ Close the context. Any future calls to object will result in an Error.
-
correlation_id
()¶ Get the correlation ID associated with the context.
- Returns
The correlation ID.
- Return type
int
-
get_async_run_results
(request_id, wait)¶ Retrieve the results of a previous async_run() using the supplied ‘request_id’
- Parameters
request_id (int) – The integer ID of the asynchronous request returned by async_run().
wait (bool) – If True block until the request results are ready. If False return immediately even if results are not ready.
- Returns
None if the results are not ready and ‘wait’ is False. A dictionary from output name to the list of values for that output (one list element for each entry of the batch). The format of a value returned for an output depends on the output format specified in ‘outputs’. For format RAW a value is a numpy array of the appropriate type and shape for the output. For format CLASS a value is the top ‘k’ output values returned as an array of (class index, class value, class label) tuples.
- Return type
dict
- Raises
InferenceServerException – If the request ID supplied is not valid, or if the server fails to perform inference.
-
get_last_request_id
()¶ Get the request ID of the most recent run() request.
- Returns
The request ID, or None if a request has not yet been made or if the last request was not successful.
- Return type
int
-
get_last_request_model_name
()¶ Get the model name used in the most recent run() request.
- Returns
The model name, or None if a request has not yet been made or if the last request was not successful.
- Return type
str
-
get_last_request_model_version
()¶ Get the model version used in the most recent run() request.
- Returns
The model version, or None if a request has not yet been made or if the last request was not successful.
- Return type
int
-
get_ready_async_request
(wait)¶ DEPRECATED: This function is deprecated and will be removed in a future version of this API. This function is only useful with the deprecated version of async_run(). Instead use async_run_with_cb().
Get the request ID of an async_run() request that has completed but not yet had results read with get_async_run_results().
- Parameters
wait (bool) – If True block until an async request is ready. If False return immediately even if results are not ready.
- Returns
None if no asynchronous results are ready and ‘wait’ is False. An integer identifier which must be passed to get_async_run_results() to wait on and retrieve the inference results.
- Return type
int
- Raises
InferenceServerException – If no asynchronous request is in flight or completed.
-
run
(inputs, outputs, batch_size=1, flags=0)¶ Run inference using the supplied ‘inputs’ to calculate the outputs specified by ‘outputs’.
- Parameters
inputs (dict) – Dictionary from input name to the value(s) for that input. An input value is specified as a numpy array. Each input in the dictionary maps to a list of values (i.e. a list of numpy array objects), where the length of the list must equal the ‘batch_size’.
outputs (dict) – Dictionary from output name to a value indicating the ResultFormat that should be used for that output. For RAW the value should be ResultFormat.RAW. For CLASS the value should be a tuple (ResultFormat.CLASS, k), where ‘k’ indicates how many classification results should be returned for the output.
batch_size (int) – The batch size of the inference. Each input must provide an appropriately sized batch of inputs.
flags (int) – The flags to use for the inference. The bitwise-or of InferRequestHeader.Flag values.
- Returns
A dictionary from output name to the list of values for that output (one list element for each entry of the batch). The format of a value returned for an output depends on the output format specified in ‘outputs’. For format RAW a value is a numpy array of the appropriate type and shape for the output. For format CLASS a value is the top ‘k’ output values returned as an array of (class index, class value, class label) tuples.
- Return type
dict
- Raises
InferenceServerException – If all inputs are not specified, if the size of input data does not match expectations, if unknown output names are specified or if server fails to perform inference.
-
exception
tensorrtserver.api.
InferenceServerException
(err)¶ Exception indicating non-Success status.
- Parameters
err (c_void_p) – Pointer to an Error that should be used to initialize the exception.
-
message
()¶ Get the exception message.
- Returns
The message associated with this exception, or None if no message.
- Return type
str
-
request_id
()¶ Get the ID of the request with this exception.
- Returns
The ID of the request associated with this exception, or 0 (zero) if no request is associated.
- Return type
int
-
server_id
()¶ Get the ID of the server associated with this exception.
- Returns
The ID of the server associated with this exception, or None if no server is associated.
- Return type
str
-
class
tensorrtserver.api.
ModelControlContext
(url, protocol, verbose=False, http_headers=[])¶ Performs a model control request to an inference server.
- Parameters
url (str) – The inference server URL, e.g. localhost:8000.
protocol (ProtocolType) – The protocol used to communicate with the server.
verbose (bool) – If True generate verbose output.
http_headers (list of strings) – HTTP headers to send with request. Ignored for GRPC protocol. Each header must be specified as “Header:Value”.
-
close
()¶ Close the context. Any future calls to load() or unload() will result in an Error.
-
get_last_request_id
()¶ Get the request ID of the most recent load() or unload() request.
- Returns
The request ID, or None if a request has not yet been made or if the last request was not successful.
- Return type
int
-
load
(model_name)¶ Request the inference server to load specified model.
- Parameters
model_name (str) – The name of the model to be loaded.
- Raises
InferenceServerException – If unable to load the model.
-
unload
(model_name)¶ Request the inference server to unload specified model.
- Parameters
model_name (str) – The name of the model to be unloaded.
- Raises
InferenceServerException – If unable to unload the model.
-
class
tensorrtserver.api.
ProtocolType
¶ Protocol types supported by the client API
- HTTP
The HTTP protocol.
- GRPC
The GRPC protocol.
-
class
tensorrtserver.api.
ServerHealthContext
(url, protocol, verbose=False, http_headers=[])¶ Performs a health request to an inference server.
- Parameters
url (str) – The inference server URL, e.g. localhost:8000.
protocol (ProtocolType) – The protocol used to communicate with the server.
verbose (bool) – If True generate verbose output.
http_headers (list of strings) – HTTP headers to send with request. Ignored for GRPC protocol. Each header must be specified as “Header:Value”.
-
close
()¶ Close the context. Any future calls to is_ready() or is_live() will result in an Error.
-
get_last_request_id
()¶ Get the request ID of the most recent is_ready() or is_live() request.
- Returns
The request ID, or None if a request has not yet been made or if the last request was not successful.
- Return type
int
-
is_live
()¶ Contact the inference server and get liveness.
- Returns
True if server is live, False if server is not live.
- Return type
bool
- Raises
InferenceServerException – If unable to get liveness.
-
is_ready
()¶ Contact the inference server and get readiness.
- Returns
True if server is ready, False if server is not ready.
- Return type
bool
- Raises
InferenceServerException – If unable to get readiness.
-
class
tensorrtserver.api.
ServerStatusContext
(url, protocol, model_name=None, verbose=False, http_headers=[])¶ Performs a status request to an inference server.
A request can be made to get status for the server and all models managed by the server, or to get status foronly a single model.
- Parameters
url (str) – The inference server URL, e.g. localhost:8000.
protocol (ProtocolType) – The protocol used to communicate with the server.
model_name (str) – The name of the model to get status for, or None to get status for all models managed by the server.
verbose (bool) – If True generate verbose output.
http_headers (list of strings) – HTTP headers to send with request. Ignored for GRPC protocol. Each header must be specified as “Header:Value”.
-
close
()¶ Close the context. Any future calls to get_server_status() will result in an Error.
-
get_last_request_id
()¶ Get the request ID of the most recent get_server_status() request.
- Returns
The request ID, or None if a request has not yet been made or if the last request was not successful.
- Return type
int
-
get_server_status
()¶ Contact the inference server and get status.
- Returns
The ServerStatus protobuf containing the status.
- Return type
ServerStatus
- Raises
InferenceServerException – If unable to get status.
Performs a shared memory control request to an inference server.
- Parameters
url (str) – The inference server URL, e.g. localhost:8000.
protocol (ProtocolType) – The protocol used to communicate with the server.
verbose (bool) – If True generate verbose output.
http_headers (list of strings) – HTTP headers to send with request. Ignored for GRPC protocol. Each header must be specified as “Header:Value”.
Close the context. Any future calls to register() or unregister() will result in an Error.
Get the request ID of the most recent register() or unregister() request.
- Returns
The request ID, or None if a request has not yet been made or if the last request was not successful.
- Return type
int
Request the inference server to register specified shared memory region.
- Parameters
shm_handle (c_void_p) – The handle for the shared memory region.
- Raises
InferenceServerException – If unable to register the shared memory region.
Request the inference server to unregister specified shared memory region.
- Parameters
shm_handle (c_void_p) – The handle for the shared memory region.
- Raises
InferenceServerException – If unable to unregister the shared memory region.