Optimization

The Triton Inference Server has many features that you can use to decrease latency and increase throughput for your model. This section discusses these features and demonstrates how you can use them to improve the performance of your model. As a prerequisite you should follow the Quickstart to get Triton and client examples running with the example model repository.

Unless you already have a client application suitable for measuring the performance of your model on Triton, you should familiarize yourself with perf_client. The perf_client application is an essential tool for optimizing your model’s performance.

As a running example demonstrating the optimization features and options, we will use a Caffe2 ResNet50 model that you can obtain by following the Quickstart. As a baseline we use perf_client to determine the performance of the model using a basic model configuration that does not enable any performance features:

$ perf_client -m resnet50_netdef --percentile=95 --concurrency-range 1:4
...
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, 159 infer/sec, latency 6701 usec
Concurrency: 2, 204.8 infer/sec, latency 9807 usec
Concurrency: 3, 204.2 infer/sec, latency 14846 usec
Concurrency: 4, 199.6 infer/sec, latency 20499 usec

The results show that our non-optimized model configuration gives a throughput of about 200 inferences per second. Note how there is a significant throughput increase going from one concurrent request to two concurrent requests and then throughput levels off. With one concurrent request Triton is idle during the time when the response is returned to the client and the next request is received at the server. Throughput increases with a concurrency of 2 because Triton overlaps the processing of one request with the communication of the other. Because we are running perf_client on the same system as Triton, 2 requests are enough to completely hide the communication latency.

Optimization Settings

For most models, the Triton feature that provides the largest performance improvement is the Dynamic Batcher. If your model does not support batching then you can skip ahead to Model Instances.

Dynamic Batcher

The dynamic batcher combines individual inference requests into a larger batch that will often execute much more efficiently than executing the individual requests independently. To enable the dynamic batcher stop Triton, add the following lines to the end of the model configuration file for resnet50_netdef, and then restart Triton:

dynamic_batching { }

The dynamic batcher allows Triton to handle a higher number of concurrent requests because those requests are combined for inference. So run perf_client with request concurrency from 1 to 8:

$ perf_client -m resnet50_netdef --percentile=95 --concurrency-range 1:8
...
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, 154.2 infer/sec, latency 6662 usec
Concurrency: 2, 203.6 infer/sec, latency 9931 usec
Concurrency: 3, 242.4 infer/sec, latency 12421 usec
Concurrency: 4, 335.6 infer/sec, latency 12423 usec
Concurrency: 5, 335.2 infer/sec, latency 16034 usec
Concurrency: 6, 363 infer/sec, latency 19990 usec
Concurrency: 7, 369.6 infer/sec, latency 21382 usec
Concurrency: 8, 426.6 infer/sec, latency 19526 usec

With eight concurrent requests the dynamic batcher allows Triton to provide about 425 inferences per second without increasing latency compared to not using the dynamic batcher.

You can also explicitly specify what batch sizes you would like the dynamic batcher to prefer when creating batches. For example, to indicate that you would like the dynamic batcher to prefer size 4 batches you can modify the model configuration like this (multiple preferred sizes can be given but in this case we just have one):

dynamic_batching { preferred_batch_size: [ 4 ]}

Instead of having perf_client collect data for a range of request concurrency values we can instead use a simple rule that typically applies when perf_client is running on the same system as Triton. The rule is that for maximum throughput set the request concurrency to be 2 * <preferred batch size> * <model instance count>. We will discuss model instances below, for now we are working with one model instance. So for preferred-batch-size 4 we want to run perf_client with request concurrency of 2 * 4 * 1 = 8:

$ perf_client -m resnet50_netdef --percentile=95 --concurrency-range 8
...
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 8, 420.2 infer/sec, latency 19524 usec

Model Instances

Triton allows you to specify how many copies of each model you want to make available for inferencing. By default you get one copy of each model, but you can specify any number of instances in the model configuration by using Instance Groups. Typically, having two instances of a model will improve performance because it allows overlap of memory transfer operations (for example, CPU to/from GPU) with inference compute. Multiple instances also improve GPU utilization by allowing more inference work to be executed simultaneously on the GPU. Smaller models may benefit from more than two instances; you can use perf_client to experiment.

To specify two instances of the resnet50_netdef model: stop Triton, remove any dynamic batching settings you may have previously added to the model configuration (we discuss combining dynamic batcher and multiple model instances below), add the following lines to the end of the model configuration file, and then restart Triton:

instance_group [ { count: 2 }]

Now run perf_client using the same options as for the baseline:

$ perf_client -m resnet50_netdef --percentile=95 --concurrency-range 1:4
...
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, 129.4 infer/sec, latency 8434 usec
Concurrency: 2, 257.4 infer/sec, latency 8126 usec
Concurrency: 3, 289.6 infer/sec, latency 12621 usec
Concurrency: 4, 287.8 infer/sec, latency 14296 usec

In this case having two instances of the model increases throughput from about 200 inference per second to about 290 inferences per second compared with one instance.

It is possible to enable both the dynamic batcher and multiple model instances, for example:

dynamic_batching { preferred_batch_size: [ 4 ] }
instance_group [ { count: 2 }]

When we run perf_client with the same options used for just the dynamic batcher above:

$ perf_client -m resnet50_netdef --percentile=95 --concurrency-range 8
...
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 8, 409.2 infer/sec, latency 24284 usec

We see that two instances does not improve throughput or latency. This occurs because for this model the dynamic batcher alone is capable of fully utilizing the GPU and so adding additional model instances does not provide any performance advantage. In general the benefit of the dynamic batcher and multiple instances is model specific, so you should experiment with perf_client to determine the settings that best satisfy your throughput and latency requirements.

Framework-Specific Optimization

Triton has several optimization settings that apply to only a subset of the supported model frameworks. These optimization settings are controlled by the model configuration optimization policy.

One especially powerful optimization that we will explore here is to use TensorRT Optimization in conjunction with a TensorFlow or ONNX model.

ONNX with TensorRT Optimization

As an example of TensorRT optimization applied to an ONNX model, we will use an ONNX DenseNet model that you can obtain by following the Quickstart. As a baseline we use perf_client to determine the performance of the model using a basic model configuration that does not enable any performance features:

$ perf_client -m densenet_onnx --percentile=95 --concurrency-range 1:4
...
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, 113.2 infer/sec, latency 8939 usec
Concurrency: 2, 138.2 infer/sec, latency 14548 usec
Concurrency: 3, 137.2 infer/sec, latency 21947 usec
Concurrency: 4, 136.8 infer/sec, latency 29661 usec

To enable TensorRT optimization for the model: stop Triton, add the following lines to the end of the model configuration file, and then restart Triton:

optimization { execution_accelerators {
  gpu_execution_accelerator : [ { name : "tensorrt" } ]
}}

As Triton starts you should check the console output and wait until Triton prints the “Staring endpoints” message. ONNX model loading can be significantly slower when TensorRT optimization is enabled. Now run perf_client using the same options as for the baseline:

$ perf_client -m densenet_onnx --percentile=95 --concurrency-range 1:4
...
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, 190.6 infer/sec, latency 5384 usec
Concurrency: 2, 273.8 infer/sec, latency 7347 usec
Concurrency: 3, 272.2 infer/sec, latency 11046 usec
Concurrency: 4, 266.8 infer/sec, latency 15089 usec

The TensorRT optimization provided 2x throughput improvement while cutting latency in half. The benefit provided by TensorRT will vary based on the model, but in general it can provide significant performance improvement.

TensorFlow with TensorRT Optimization

TensorRT optimization applied to a TensorFlow model works similarly to TensorRT and ONNX described above. To enable TensorRT optimization you must set the model configuration appropriately. For TensorRT optimization of TensorFlow models there are several options that you can enable, including selection of the compute precision. For example:

optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP16" }}]
}}

The options are described in detail in the ModelOptimizationPolicy section of the model configuration protobuf.

As an example of TensorRT optimization applied to a TensorFlow model, we will use a TensorFlow Inception model that you can obtain by following the Quickstart. As a baseline we use perf_client to determine the performance of the model using a basic model configuration that does not enable any performance features:

$ perf_client -m inception_graphdef --percentile=95 --concurrency-range 1:4
...
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, 105.6 infer/sec, latency 12865 usec
Concurrency: 2, 120.6 infer/sec, latency 20888 usec
Concurrency: 3, 122.8 infer/sec, latency 30308 usec
Concurrency: 4, 123.4 infer/sec, latency 39465 usec

To enable TensorRT optimization for the model: stop Triton, add the lines from above to the end of the model configuration file, and then restart Triton. As Triton starts you should check the console output and wait until the server prints the “Staring endpoints” message. Now run perf_client using the same options as for the baseline. Note that the first run of perf_client might timeout because the TensorRT optimization is performed when the inference request is received and may take significant time. If this happens just run perf_client again:

$ perf_client -m inception_graphdef --percentile=95 --concurrency-range 1:4
...
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, 172 infer/sec, latency 6912 usec
Concurrency: 2, 265.2 infer/sec, latency 8905 usec
Concurrency: 3, 254.2 infer/sec, latency 13506 usec
Concurrency: 4, 257 infer/sec, latency 17715 usec

The TensorRT optimization provided 2x throughput improvement while cutting latency in half. The benefit provided by TensorRT will vary based on the model, but in general it can provide significant performance improvement.

perf_client

A critical part of optimizing the inference performance of your model is being able to measure changes in performance as you experiment with different optimization strategies. The perf_client application performs this task for the Triton Inference Server. The perf_client is included with the client examples which are available from several sources.

The perf_client generates inference requests to your model and measures the throughput and latency of those requests. To get representative results, the perf_client measures the throughput and latency over a time window, and then repeats the measurements until it gets stable values. By default the perf_client uses average latency to determine stability but you can use the --percentile flag to stabilize results based on that confidence level. For example, if --percentile=95 is used the results will be stabilized using the 95-th percentile request latency. For example:

$ perf_client -m resnet50_netdef --percentile=95
*** Measurement Settings ***
  Batch size: 1
  Measurement window: 5000 msec
  Stabilizing using p95 latency

Request concurrency: 1
  Client:
    Request count: 809
    Throughput: 161.8 infer/sec
    p50 latency: 6178 usec
    p90 latency: 6237 usec
    p95 latency: 6260 usec
    p99 latency: 6339 usec
    Avg HTTP time: 6153 usec (send/recv 72 usec + response wait 6081 usec)
  Server:
    Request count: 971
    Avg request latency: 4824 usec (overhead 10 usec + queue 39 usec + compute 4775 usec)

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, 161.8 infer/sec, latency 6260 usec

Request Concurrency

By default perf_client measures your model’s latency and throughput using the lowest possible load on the model. To do this perf_client sends one inference request to Triton and waits for the response. When that response is received, the perf_client immediately sends another request, and then repeats this process during the measurement windows. The number of outstanding inference requests is referred to as the request concurrency, and so by default perf_client uses a request concurrency of 1.

Using the --concurrency-range <start>:<end>:<step> option you can have perf_client collect data for a range of request concurrency levels. Use the --help option to see complete documentation for this and other options. For example, to see the latency and throughput of your model for request concurrency values from 1 to 4:

$ perf_client -m resnet50_netdef --concurrency-range 1:4
*** Measurement Settings ***
  Batch size: 1
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 4 concurrent requests
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 804
    Throughput: 160.8 infer/sec
    Avg latency: 6207 usec (standard deviation 267 usec)
    p50 latency: 6212 usec
...
Request concurrency: 4
  Client:
    Request count: 1042
    Throughput: 208.4 infer/sec
    Avg latency: 19185 usec (standard deviation 105 usec)
    p50 latency: 19168 usec
    p90 latency: 19218 usec
    p95 latency: 19265 usec
    p99 latency: 19583 usec
    Avg HTTP time: 19156 usec (send/recv 79 usec + response wait 19077 usec)
  Server:
    Request count: 1250
    Avg request latency: 18099 usec (overhead 9 usec + queue 13314 usec + compute 4776 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, 160.8 infer/sec, latency 6207 usec
Concurrency: 2, 209.2 infer/sec, latency 9548 usec
Concurrency: 3, 207.8 infer/sec, latency 14423 usec
Concurrency: 4, 208.4 infer/sec, latency 19185 usec

Understanding The Output

For each request concurrency level perf_client reports latency and throughput as seen from the client (that is, as seen by perf_client) and also the average request latency on the server.

The server latency measures the total time from when the request is received at the server until the response is sent from the server. Because of the HTTP and GRPC libraries used to implement the server endpoints, total server latency is typically more accurate for HTTP requests as it measures time from first byte received until last byte sent. For both HTTP and GRPC the total server latency is broken-down into the following components:

  • queue: The average time spent in the inference schedule queue by a request waiting for an instance of the model to become available.

  • compute: The average time spent performing the actual inference, including any time needed to copy data to/from the GPU.

The client latency time is broken-down further for HTTP and GRPC as follows:

  • HTTP: send/recv indicates the time on the client spent sending the request and receiving the response. response wait indicates time waiting for the response from the server.

  • GRPC: (un)marshal request/response indicates the time spent marshalling the request data into the GRPC protobuf and unmarshalling the response data from the GRPC protobuf. response wait indicates time writing the GRPC request to the network, waiting for the response, and reading the GRPC response from the network.

Use the verbose (-v) option to perf_client to see more output, including the stabilization passes run for each request concurrency level.

Visualizing Latency vs. Throughput

The perf_client provides the -f option to generate a file containing CSV output of the results:

$ perf_client -m resnet50_netdef --concurrency-range 1:4 -f perf.csv
$ cat perf.csv
Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency
1,163.6,69,1230,33,43,4719,5,9,6133,6191,6224,6415
2,208.6,180,1306,3299,43,4720,5,28,9482,9617,10746,10832
4,209.8,173,1268,12835,40,4705,4,27,19046,19133,19164,19290
3,210.2,175,1267,8052,40,4697,4,27,14259,14325,14350,14426

You can import the CSV file into a spreadsheet to help visualize the latency vs inferences/second tradeoff as well as see some components of the latency. Follow these steps:

  • Open this spreadsheet

  • Make a copy from the File menu “Make a copy…”

  • Open the copy

  • Select the A1 cell on the “Raw Data” tab

  • From the File menu select “Import…”

  • Select “Upload” and upload the file

  • Select “Replace data at selected cell” and then select the “Import data” button

Input Data

Use the --help option to see complete documentation for all input data options. By default perf_client sends random data to all the inputs of your model. You can select a different input data mode with the --input-data option:

  • random: (default) Send random data for each input.

  • zero: Send zeros for each input.

  • directory path: A path to a directory containing a binary file for each input, named the same as the input. Each binary file must contain the data required for that input for a batch-1 request. Each file should contain the raw binary representation of the input in row-major order.

  • file path: A path to a JSON file containing data to be used with every inference request. See the “Real Input Data” section for further details. –input-data can be provided multiple times with different file paths to specific multiple JSON files.

For tensors with with STRING datatype there are additional options --string-length and --string-data that may be used in some cases (see --help for full documentation).

For models that support batching you can use the -b option to indicate the batch-size of the requests that perf_client should send. For models with variable-sized inputs you must provide the --shape argument so that perf_client knows what shape tensors to use. For example, for a model that has an input called IMAGE that has shape [ 3, N, M ], where N and M are variable-size dimensions, to tell perf_client to send batch-size 4 requests of shape [ 3, 224, 224 ]:

$ perf_client -m mymodel -b 4 --shape IMAGE:3,224,224

Real Input Data

The performance of some models is highly dependent on the data used. For such cases users can provide data to be used with every inference request made by client in a JSON file. The perf_client will use the provided data when sending inference requests in a round-robin fashion.

Each entry in the “data” array must specify all input tensors with the exact size expected by the model from a single batch. The following example describes data for a model with inputs named, INPUT0 and INPUT1, shape [4, 4] and data type INT32:

{
  "data" :
   [
      {
        "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
      },
      {
        "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
      },
      {
        "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
      },
      {
        "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
      }
      .
      .
      .
    ]
}

Kindly note that the [4, 4] tensor has been flattened in a row-major format for the inputs.

A part from specifying explicit tensors, users can also provide Base64 encoded binary data for the tensors. Each data object must list its data in a row-major order. The following example highlights how this can be acheived:

{
  "data" :
   [
      {
        "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="},
        "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="}
      },
      {
        "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="},
        "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="}
      },
      {
        "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="},
        "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="}
      },
      .
      .
      .
    ]
}

In case of sequence models, multiple data streams can be specified in the JSON file. Each sequence will get a data stream of its own and the client will ensure the data from each stream is played back to the same correlation id. The below example highlights how to specify data for multiple streams for a sequence model with a single input named INPUT, shape [1] and data type STRING:

{
  "data" :
    [
      [
        {
          "INPUT" : ["1"]
        },
        {
          "INPUT" : ["2"]
        },
        {
          "INPUT" : ["3"]
        },
        {
          "INPUT" : ["4"]
        }
      ],
      [
        {
          "INPUT" : ["1"]
        },
        {
          "INPUT" : ["1"]
        },
        {
          "INPUT" : ["1"]
        }
      ],
      [
        {
          "INPUT" : ["1"]
        },
        {
          "INPUT" : ["1"]
        }
      ]
    ]
}

The above example describes three data streams with lengths 4, 3 and 2 respectively. The perf_client will hence produce sequences of length 4, 3 and 2 in this case.

Users can also provide an optional “shape” field to the tensors. This is especially useful while profiling the models with variable-sized tensors as input. The specified shape values are treated as an override and client still expects default input shapes to be provided as a command line option (see –shape) for variable-sized inputs. In the absence of “shape” field, the provided defaults will be used. Below is an example json file for a model with single input “INPUT”, shape [-1,-1] and data type INT32:

{
  "data" :
   [
      {
        "INPUT" :
              {
                  "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                  "shape": [2,8]
              }
      },
      {
        "INPUT" :
              {
                  "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                  "shape": [8,2]
              }
      },
      {
        "INPUT" :
              {
                  "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
              }
      },
      {
        "INPUT" :
              {
                  "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                  "shape": [4,4]
              }
      }
      .
      .
      .
    ]
}

Shared Memory

By default perf_client sends input tensor data and receives output tensor data over the network. You can instead instruct perf_client to use system shared memory or CUDA shared memory to communicate tensor data. By using these options you can model the performance that you can achieve by using shared memory in your application. Use --shared-memory=system to use system (CPU) shared memory or --shared-memory=cuda to use CUDA shared memory.

Communication Protocol

By default perf_client uses HTTP to communicate with Triton. The GRPC protocol can be specificed with the -i option. If GRPC is selected the --streaming option can also be specified for GRPC streaming.

Server Trace

Triton includes that capability to generate a detailed trace for individual inference requests. If you are building your own inference server you must use the -DTRITON_ENABLE_TRACING=ON option when configuring cmake.

Tracing is enable by command-line arguments when running the tritonserver executable. For example:

$ tritonserver --trace-file=/tmp/trace.json --trace-rate=100 --trace-level=MAX ...

The --trace-file options indicates where the trace output should be written. The --trace-rate option specifies the sampling rate. In this example every 100-th inference request will be traced. The --trace-level option indicates the level of trace detail that should be collected. Use the --help option to get more information.

JSON Trace Output

The trace output is a JSON file with the following schema:

[
  {
    "model_name": $string,
    "model_version": $number,
    "id": $number
    "parent_id": $number,
    "timestamps": [
      { "name" : $string, "ns" : $number },
      ...
    ]
  },
  ...
]

Each trace indicates the model name and version of the inference request. Each trace is assigned a unique “id”. If the trace is from a model run as part of an ensemble the “parent_id” will indicate the “id” of the containing ensemble. Each trace will have one or more “timestamps” with each timestamp having a name and the timestamp in nanoseconds (“ns”). For example:

[
  {
    "model_name": "simple",
    "model_version": -1,
    "id":1,
    "timestamps" : [
      { "name": "http recv start", "ns": 2259961222771924 },
      { "name": "http recv end", "ns": 2259961222820985 },
      { "name": "request handler start", "ns": 2259961223164078 },
      { "name": "queue start", "ns": 2259961223182400 },
      { "name": "compute start", "ns": 2259961223232405 },
      { "name": "compute end", "ns": 2259961230206777 },
      { "name": "request handler end", "ns": 2259961230211887 },
      { "name": "http send start", "ns": 2259961230529606 },
      { "name": "http send end", "ns": 2259961230543930 } ]
   }
 ]

Trace Summary Tool

An example trace summary tool can be used to summarize a set of traces collected from Triton. Basic usage is:

$ trace_summary.py <trace file>

This produces a summary report for all traces in the file. HTTP and GRPC inference requests are reported separately:

File: trace.json
Summary for simple (-1): trace count = 1
HTTP infer request (avg): 378us
      Receive (avg): 21us
      Send (avg): 7us
      Overhead (avg): 79us
      Handler (avg): 269us
              Overhead (avg): 11us
              Queue (avg): 15us
              Compute (avg): 242us
                      Input (avg): 18us
                      Infer (avg): 208us
                      Output (avg): 15us
Summary for simple (-1): trace count = 1
GRPC infer request (avg): 21441us
      Wait/Read (avg): 20923us
      Send (avg): 74us
      Overhead (avg): 46us
      Handler (avg): 395us
              Overhead (avg): 16us
              Queue (avg): 47us
              Compute (avg): 331us
                      Input (avg): 30us
                      Infer (avg): 286us
                      Output (avg): 14us

Use the -t option to get a summary for each trace in the file. This summary shows the time, in microseconds, between different points in the processing of an inference request. For example, the below output shows that it took 15us from the start of handling the request until the request was enqueued in the scheduling queue:

$ trace_summary.py -t <trace file>
...
simple (-1):
      grpc wait/read start
              26529us
      grpc wait/read end
              39us
      request handler start
              15us
      queue start
              20us
      compute start
              266us
      compute end
              4us
      request handler end
              19us
      grpc send start
              77us
      grpc send end
...

The meaning of the trace timestamps is:

  • GRPC Request Wait/Read: Collected only for inference requests that use the GRPC protocol. The time spent waiting for a request to arrive at the server and for that request to be read. Because wait time is included in the time it is not a useful measure of how much time is spent reading a request from the network. Tracing an HTTP request will provide an accurate measure of the read time.

  • HTTP Request Receive: Collected only for inference requests that use the HTTP protocol. The time required to read the inference request from the network.

  • Send: The time required to send the inference response.

  • Overhead: Additional time required in the HTTP or GRPC endpoint to process the inference request and response.

  • Handler: The total time spent handling the inference request, not including the HTTP and GRPC request/response handling.

    • Queue: The time the inference request spent in the scheduling queue.

    • Compute: The time the inference request spent executing the actual inference. This time includes the time spent copying input and output tensors. If —trace-level=MAX then a breakdown of the compute time will be provided as follows:

      • Input: The time to copy input tensor data as required by the inference framework / backend. This includes the time to copy input tensor data to the GPU.

      • Infer: The time spent executing the model to perform the inference.

      • Output: The time to copy output tensor data as required by the inference framework / backend. This includes the time to copy output tensor data from the GPU.

    • Overhead: Additional time required for request handling not covered by Queue or Compute times.