Inference Server API

The TensorRT Inference Server exposes both HTTP and GRPC endpoints. Three endpoints with identical functionality are exposed for each protocol.

  • Health: The server health API for determining server liveness and readiness.

  • Status: The server status API for getting information about the server and about the models being served.

  • Inference: The inference API that accepts model inputs, runs inference and returns the requested outputs.

The inference server also exposes an endpoint based on GRPC streams that is only available when using the GRPC protocol:

  • Stream Inference: The stream inference API is the same as the Inference API, except that once the connection is established, the requests are sent in the same connection until it is closed.

The HTTP endpoints can be used directly as described in this section, but for most use-cases, the preferred way to access the inference server is via the C++ and Python Client libraries.

The GRPC endpoints can also be used via the C++ and Python Client libraries or a GRPC-generated API can be used directly as shown in the grpc_image_client.py example.

Health

Performing an HTTP GET to /api/health/live returns a 200 status if the server is able to receive and process requests. Any other status code indicates that the server is still initializing or has failed in some way that prevents it from processing requests.

Once the liveness endpoint indicates that the server is active, performing an HTTP GET to /api/health/ready returns a 200 status if the server is able to respond to inference requests for some or all models (based on the inference server’s --strict-readiness option explained below). Any other status code indicates that the server is not ready to respond to some or all inference requests.

For GRPC the GRPCService uses the HealthRequest and HealthResponse messages to implement the endpoint.

By default, the readiness endpoint will return success if the server is responsive and all models loaded successfully. Thus, by default, success indicates that an inference request for any model can be handled by the server. For some use cases, you want the readiness endpoint to return success even if all models are not available. In this case, use the --strict-readiness=false option to cause the readiness endpoint to report success as long as the server is responsive (even if one or more models are not available).

Status

Performing an HTTP GET to /api/status returns status information about the server and all the models being served. Performing an HTTP GET to /api/status/<model name> returns information about the server and the single model specified by <model name>. The server status is returned in the HTTP response body in either text format (the default) or in binary format if query parameter format=binary is specified (for example, /api/status?format=binary). The success or failure of the status request is indicated in the HTTP response code and the NV-Status response header. The NV-Status response header returns a text protobuf formatted RequestStatus message.

For GRPC the GRPCService uses the StatusRequest and StatusResponse messages to implement the endpoint. The response includes a RequestStatus message indicating success or failure.

For either protocol the status itself is returned as a ServerStatus message.

Inference

Performing an HTTP POST to /api/infer/<model name> performs inference using the latest version of the model that is being made available by the model’s version policy. The latest version is the numerically greatest version number. Performing an HTTP POST to /api/infer/<model name>/<model version> performs inference using a specific version of the model.

The request uses the NV-InferRequest header to communicate an InferRequestHeader message that describes the input tensors and the requested output tensors. For example, for a resnet50 model the following NV-InferRequest header indicates that a batch-size 1 request is being made with a single input named “input”, and that the result of the tensor named “output” should be returned as the top-3 classification values:

NV-InferRequest: batch_size: 1 input { name: "input" } output { name: "output" cls { count: 3 } }

The input tensor values are communicated in the body of the HTTP POST request as raw binary in the order as the inputs are listed in the request header.

The HTTP response includes an NV-InferResponse header that communicates an InferResponseHeader message that describes the outputs. For example the above response could return the following:

NV-InferResponse: model_name: "mymodel" model_version: 1 batch_size: 1 output { name: "output" raw { dims: 4 dims: 4 batch_byte_size: 64 } }

This response shows that the output in a tensor with shape [ 4, 4 ] and has a size of 64 bytes. The output tensor contents are returned in the body of the HTTP response to the POST request. For outputs where full result tensors were requested, the result values are communicated in the body of the response in the order as the outputs are listed in the NV-InferResponse header. After those, an InferResponseHeader message is appended to the response body. The InferResponseHeader message is returned in either text format (the default) or in binary format if query parameter format=binary is specified (for example, /api/infer/foo?format=binary).

For example, assuming an inference request for a model that has ‘n’ outputs, the outputs specified in the NV-InferResponse header in order are “output[0]”, …, “output[n-1]” the response body would contain:

<raw binary tensor values for output[0] >
...
<raw binary tensor values for output[n-1] >
<text or binary encoded InferResponseHeader proto>

The success or failure of the inference request is indicated in the HTTP response code and the NV-Status response header. The NV-Status response header returns a text protobuf formatted RequestStatus message.

For GRPC the GRPCService uses the InferRequest and InferResponse messages to implement the endpoint. The response includes a RequestStatus message indicating success or failure, InferResponseHeader message giving response meta-data, and the raw output tensors.

Stream Inference

Some applications may request that multiple requests be sent using one persistent connection rather than potentially establishing multiple connections. For instance, in the case where multiple instances of TensorRT Inference Server are created with the purpose of load balancing, requests sent in different connections may be routed to different server instances. This scenario will not fit the need if the requests are correlated, where they are expected to be processed by the same model instance, like inferencing with stateful models. By using stream inference, the requests will be sent to the same server instance once the connection is established.

For GRPC the GRPCService uses the InferRequest and InferResponse messages to implement the endpoint. The response includes a RequestStatus message indicating success or failure, InferResponseHeader message giving response meta-data, and the raw output tensors.