Python API¶

Client¶

class tensorrtserver.api.InferContext(url, protocol, model_name, model_version=None, verbose=False, correlation_id=0, streaming=False, http_headers=[])¶

An InferContext object is used to run inference on an inference server for a specific model.

Once created an InferContext object can be used repeatedly to perform inference using the model.

Parameters

url (str) – The inference server URL, e.g. localhost:8000.
protocol (ProtocolType) – The protocol used to communicate with the server.
model_name (str) – The name of the model to use for inference.
model_version (int) – The version of the model to use for inference, or None to indicate that the latest (i.e. highest version number) version should be used.
verbose (bool) – If True generate verbose output.
correlation_id (int) – The correlation ID for the inference. If not specified (or if specified as 0), the inference will have no correlation ID.
streaming (bool) – If True create streaming context. Streaming is only allowed with gRPC protocol.
http_headers (list of strings) – HTTP headers to send with request. Ignored for GRPC protocol. Each header must be specified as “Header:Value”.

class ResultFormat¶

Formats for output tensor results.

RAW: All values of the output are returned as an numpy array of the appropriate type.
CLASS: Specified as tuple (CLASS, k). Top ‘k’ results are returned as an array of (index, value, label) tuples.

async_run(callback, inputs, outputs, batch_size=1, flags=0, corr_id=0)¶

Run inference using the supplied ‘inputs’ to calculate the outputs specified by ‘outputs’.

Once the request is completed, the InferContext object and the integer identifier will be passed to the provided ‘callback’ function. It is the function caller’s choice on either retrieving the results inside the callback function or deferring it to a different thread so that the InferContext is unblocked.

Parameters

callback (function) – Python function that accepts an InferContext object that sends the request and an integer identifier as arguments. This function will be invoked once the request is completed.
inputs (dict) – Dictionary from input name to the value(s) for that input. An input value is specified as a numpy array. Each input in the dictionary maps to a list of values (i.e. a list of numpy array objects), where the length of the list must equal the ‘batch_size’. However, for shape tensor input the list should contain just a single tensor.
outputs (dict) – Dictionary from output name to a value indicating the ResultFormat that should be used for that output. For RAW the value should be ResultFormat.RAW. For CLASS the value should be a tuple (ResultFormat.CLASS, k), where ‘k’ indicates how many classification results should be returned for the output.
batch_size (int) – The batch size of the inference. Each input must provide an appropriately sized batch of inputs.
corr_id (int) – The correlation id of the inference. If non-zero this correlation ID overrides the context’s correlation ID for all subsequent inference requests, else the inference request uses the context’s correlation ID.
flags (int) – The flags to use for the inference. The bitwise-or of InferRequestHeader.Flag values.

Raises

InferenceServerException – If all inputs are not specified, if the size of input data does not match expectations, if unknown output names are specified or if server fails to perform inference.

close()¶: Close the context. Any future calls to object will result in an Error.

correlation_id()¶

Get the correlation ID associated with the context.

Returns: The correlation ID.
Return type: int

get_async_run_results(request_id)¶

Retrieve the results of a previous async_run() using the supplied ‘request_id’

Parameters: request_id (int) – The integer ID of the asynchronous request exposed in the callback function of the async_run.
Returns: A dictionary from output name to the list of values for that output (one list element for each entry of the batch). The format of a value returned for an output depends on the output format specified in ‘outputs’. For format RAW a value is a numpy array of the appropriate type and shape for the output. For format CLASS a value is the top ‘k’ output values returned as an array of (class index, class value, class label) tuples.
Return type: dict
Raises: InferenceServerException – If the request ID supplied is not valid, or if the server fails to perform inference.

get_last_request_id()¶

Get the request ID of the most recent run() request.

Returns: The request ID, or None if a request has not yet been made or if the last request was not successful.
Return type: int

get_last_request_model_name()¶

Get the model name used in the most recent run() request.

Returns: The model name, or None if a request has not yet been made or if the last request was not successful.
Return type: str

get_last_request_model_version()¶

Get the model version used in the most recent run() request.

Returns: The model version, or None if a request has not yet been made or if the last request was not successful.
Return type: int

get_stat()¶

Get the current statistics of the InferContext.

Returns: Containing the completed_request_count, cumulative_total_request_time_ns, cumulative_send_time_ns and cumulative_receive_time_ns with their respective keys.
Return type: dict
Raises: InferenceServerException – If fails to retrieve the statistics.

run(inputs, outputs, batch_size=1, flags=0, corr_id=0)¶

Run inference using the supplied ‘inputs’ to calculate the outputs specified by ‘outputs’.

Parameters

inputs (dict) – Dictionary from input name to the value(s) for that input. An input value is specified as a numpy array. Each input in the dictionary maps to a list of values (i.e. a list of numpy array objects), where the length of the list must equal the ‘batch_size’. However, for shape tensor input the list should contain just a single tensor.
outputs (dict) – Dictionary from output name to a value indicating the ResultFormat that should be used for that output. For RAW the value should be ResultFormat.RAW. For CLASS the value should be a tuple (ResultFormat.CLASS, k), where ‘k’ indicates how many classification results should be returned for the output.
batch_size (int) – The batch size of the inference. Each input must provide an appropriately sized batch of inputs.
flags (int) – The flags to use for the inference. The bitwise-or of InferRequestHeader.Flag values.
corr_id (int) – The correlation id of the inference. Used to differentiate sequences.

Returns

A dictionary from output name to the list of values for that output (one list element for each entry of the batch). The format of a value returned for an output depends on the output format specified in ‘outputs’. For format RAW a value is a numpy array of the appropriate type and shape for the output. For format CLASS a value is the top ‘k’ output values returned as an array of (class index, class value, class label) tuples.

Return type

dict

Raises

InferenceServerException – If all inputs are not specified, if the size of input data does not match expectations, if unknown output names are specified or if server fails to perform inference.

exception tensorrtserver.api.InferenceServerException(err)¶

Exception indicating non-Success status.

Parameters: err (c_void_p) – Pointer to an Error that should be used to initialize the exception.

message()¶

Get the exception message.

Returns: The message associated with this exception, or None if no message.
Return type: str

request_id()¶

Get the ID of the request with this exception.

Returns: The ID of the request associated with this exception, or 0 (zero) if no request is associated.
Return type: int

server_id()¶

Get the ID of the server associated with this exception.

Returns: The ID of the server associated with this exception, or None if no server is associated.
Return type: str

class tensorrtserver.api.ModelControlContext(url, protocol, verbose=False, http_headers=[])¶

Performs a model control request to an inference server.

Parameters

url (str) – The inference server URL, e.g. localhost:8000.
protocol (ProtocolType) – The protocol used to communicate with the server.
verbose (bool) – If True generate verbose output.
http_headers (list of strings) – HTTP headers to send with request. Ignored for GRPC protocol. Each header must be specified as “Header:Value”.

close()¶: Close the context. Any future calls to load() or unload() will result in an Error.

get_last_request_id()¶

Get the request ID of the most recent load() or unload() request.

Returns: The request ID, or None if a request has not yet been made or if the last request was not successful.
Return type: int

load(model_name)¶

Request the inference server to load specified model.

Parameters: model_name (str) – The name of the model to be loaded.
Raises: InferenceServerException – If unable to load the model.

unload(model_name)¶

Request the inference server to unload specified model.

Parameters: model_name (str) – The name of the model to be unloaded.
Raises: InferenceServerException – If unable to unload the model.

class tensorrtserver.api.ModelRepositoryContext(url, protocol, verbose=False, http_headers=[])¶

Performs a model repository request to an inference server.

A request can be made to get model repository information of the server

Parameters

url (str) – The inference server URL, e.g. localhost:8000.
protocol (ProtocolType) – The protocol used to communicate with the server.
verbose (bool) – If True generate verbose output.
http_headers (list of strings) – HTTP headers to send with request. Ignored for GRPC protocol. Each header must be specified as “Header:Value”.

close()¶: Close the context. Any future calls to get_model_repository_index() will result in an Error.

get_last_request_id()¶

Get the request ID of the most recent get_model_repository_index() request.

Returns: The request ID, or None if a request has not yet been made or if the last request was not successful.
Return type: int

get_model_repository_index()¶

Contact the inference server and get the index of the model repository.

Returns: The ModelRepositoryIndex protobuf containing the index.
Return type: ModelRepositoryIndex
Raises: InferenceServerException – If unable to get index.

class tensorrtserver.api.ProtocolType¶

Protocol types supported by the client API

HTTP: The HTTP protocol.
GRPC: The GRPC protocol.

class tensorrtserver.api.ServerHealthContext(url, protocol, verbose=False, http_headers=[])¶

Performs a health request to an inference server.

Parameters

url (str) – The inference server URL, e.g. localhost:8000.
protocol (ProtocolType) – The protocol used to communicate with the server.
verbose (bool) – If True generate verbose output.
http_headers (list of strings) – HTTP headers to send with request. Ignored for GRPC protocol. Each header must be specified as “Header:Value”.

close()¶: Close the context. Any future calls to is_ready() or is_live() will result in an Error.

get_last_request_id()¶

Get the request ID of the most recent is_ready() or is_live() request.

Returns: The request ID, or None if a request has not yet been made or if the last request was not successful.
Return type: int

is_live()¶

Contact the inference server and get liveness.

Returns: True if server is live, False if server is not live.
Return type: bool
Raises: InferenceServerException – If unable to get liveness.

is_ready()¶

Contact the inference server and get readiness.

Returns: True if server is ready, False if server is not ready.
Return type: bool
Raises: InferenceServerException – If unable to get readiness.

class tensorrtserver.api.ServerStatusContext(url, protocol, model_name=None, verbose=False, http_headers=[])¶

Performs a status request to an inference server.

A request can be made to get status for the server and all models managed by the server, or to get status foronly a single model.

Parameters

url (str) – The inference server URL, e.g. localhost:8000.
protocol (ProtocolType) – The protocol used to communicate with the server.
model_name (str) – The name of the model to get status for, or None to get status for all models managed by the server.
verbose (bool) – If True generate verbose output.
http_headers (list of strings) – HTTP headers to send with request. Ignored for GRPC protocol. Each header must be specified as “Header:Value”.

close()¶: Close the context. Any future calls to get_server_status() will result in an Error.

get_last_request_id()¶

Get the request ID of the most recent get_server_status() request.

Returns: The request ID, or None if a request has not yet been made or if the last request was not successful.
Return type: int

get_server_status()¶

Contact the inference server and get status.

Returns: The ServerStatus protobuf containing the status.
Return type: ServerStatus
Raises: InferenceServerException – If unable to get status.

class tensorrtserver.api.SharedMemoryControlContext(url, protocol, verbose=False, http_headers=[])¶

Performs a shared memory control request to an inference server.

Parameters

url (str) – The inference server URL, e.g. localhost:8000.
protocol (ProtocolType) – The protocol used to communicate with the server.
verbose (bool) – If True generate verbose output.
http_headers (list of strings) – HTTP headers to send with request. Ignored for GRPC protocol. Each header must be specified as “Header:Value”.

close()¶: Close the context. Any future calls to register() or unregister() will result in an Error.

cuda_register(cuda_shm_handle)¶

Request the inference server to register specified shared memory region.

Parameters: cuda_shm_handle (c_void_p) – The handle for the CUDA shared memory region.
Raises: InferenceServerException – If unable to register the shared memory region.

get_last_request_id()¶

Get the request ID of the most recent register() or unregister() request.

Returns: The request ID, or None if a request has not yet been made or if the last request was not successful.
Return type: int

get_shared_memory_status()¶

Contact the inference server and get status.

Returns: The SharedMemoryStatus protobuf containing the status.
Return type: SharedMemoryStatus
Raises: InferenceServerException – If unable to get status.

register(shm_handle)¶

Request the inference server to register specified shared memory region.

Parameters: shm_handle (c_void_p) – The handle for the shared memory region.
Raises: InferenceServerException – If unable to register the shared memory region.

unregister(shm_handle)¶

Request the inference server to unregister specified shared memory region.

Parameters: shm_handle (c_void_p) – The handle for the shared memory region.
Raises: InferenceServerException – If unable to unregister the shared memory region.

unregister_all()¶

Request the inference server to unregister all shared memory regions.

Raises: InferenceServerException – If unable to unregister any shared memory regions.

tensorrtserver.api.serialize_string_tensor(input_tensor)¶

Serializes a string tensor into a flat numpy array of length prepend strings. Can pass string tensor as numpy array of bytes with dtype of np.bytes_, numpy strings with dtype of np.str_ or python strings with dtype of np.object.

Parameters: input_tensor (np.array) – The string tensor to serialize.
Returns: serialized_string_tensor – The 1-D numpy array of type uint8 containing the serialized string in ‘C’ order.
Return type: np.array
Raises: InferenceServerException – If unable to serialize the given tensor.