.. # Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # * Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # * Neither the name of NVIDIA CORPORATION nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. .. _section-client-libraries-and-examples: Client Libraries and Examples ============================= The inference server *client libraries* make it easy to communicate with the TensorRT Inference Server from you C++ or Python application. Using these libraries you can send either HTTP or GRPC requests to server to check status or health and to make inference requests. A couple of example applications show how to use the client libraries to perform image classification and to test performance: * C++ and Python versions of *image\_client*, an example application that uses the C++ or Python client library to execute image classification models on the TensorRT Inference Server. * Python version of *grpc\_image\_client*, an example application that is functionally equivalent to *image\_client* but that uses GRPC generated client code to communicate with the inference server (instead of the client library). * C++ version of *perf\_client*, an example application that issues a large number of concurrent requests to the inference server to measure latency and throughput for a given model. You can use this to experiment with different model configuration settings for your models. .. build-client-begin-marker-do-not-remove .. _section-building-the-client-libraries-and-examples: Building the Client Libraries and Examples ------------------------------------------ The provided Dockerfile can be used to build just the client libraries and examples. Issue the following command to build the C++ client library, C++ and Python examples, and a Python wheel file for the Python client library:: $ docker build -t tensorrtserver_clients --target trtserver_build --build-arg "BUILD_CLIENTS_ONLY=1" . You can optionally add *-\\-build-arg "PYVER="* to set the Python version that you want the Python client library built for. Supported values for ** are 2.6 and 3.5, with 3.5 being the default. After the build completes the tensorrtserver_clients docker image will contain the built client libraries and examples. The easiest way to try the examples described in the following sections is to run the client image with -\\-net=host so that the client examples can access the inference server running in its own container (see :ref:`section-running-the-inference-server` for more information about running the inference server):: $ docker run -it --rm --net=host tensorrtserver_clients In the client image you can find the example executables in /opt/tensorrtserver/bin, and the Python wheel in /opt/tensorrtserver/pip. If your host sytem is Ubuntu-16.04, an alternative to running the examples within the tensorrtserver_clients container is to instead download the client libraries and examples from the `GitHub release page `_ corresponding to the release you are interested in:: $ mkdir tensorrtserver_clients $ cd tensorrtserver_clients $ wget https://github.com/NVIDIA/tensorrt-inference-server/archive/v0.11.0.clients.tar.gz $ tar xzf v0.11.0.clients.tar.gz You can now find client example binaries in bin/, c++ libraries in lib/, and Python client examples and wheel file in python/. To run the C++ examples you must install some dependencies on your Ubuntu-16.04 host system:: $ apt-get install curl libcurl3-dev libopencv-dev libopencv-core-dev To run the Python examples you will need to additionally install the wheel file and some other dependencies:: $ apt-get install python3 python3-pip $ pip3 install --user --upgrade tensorrtserver-*.whl numpy pillow .. build-client-end-marker-do-not-remove .. _section-image_classification_example: Image Classification Example Application ---------------------------------------- The image classification example that uses the C++ client API is available at `src/clients/c++/image\_client.cc `_. The Python version of the image classification client is available at `src/clients/python/image\_client.py `_. To use image\_client (or image\_client.py) you must first have a running inference server that is serving one or more image classification models. The image\_client application requires that the model have a single image input and produce a single classification output. If you don't have a model repository with image classification models see :ref:`section-example-model-repository` for instructions on how to create one. Follow the instructions in :ref:`section-running-the-inference-server` to launch the server using the model repository. Once the server is running you can use the image\_client application to send inference requests to the server. You can specify a single image or a directory holding images. Here we send a request for the resnet50_netdef model from the :ref:`example model repository ` for an image from the `qa/images `_ directory:: $ /opt/tensorrtserver/bin/image_client -m resnet50_netdef -s INCEPTION qa/images/mug.jpg Request 0, batch size 1 Image '../qa/images/mug.jpg': 504 (COFFEE MUG) = 0.723991 The Python version of the application accepts the same command-line arguments:: $ python3 /workspace/src/clients/python/image_client.py -m resnet50_netdef -s INCEPTION qa/images/mug.jpg Request 0, batch size 1 Image '../qa/images/mug.jpg': 504 (COFFEE MUG) = 0.778078556061 The image\_client and image\_client.py applications use the inference server client library to talk to the server. By default image\_client instructs the client library to use HTTP protocol to talk to the server, but you can use GRPC protocol by providing the \-i flag. You must also use the \-u flag to point at the GRPC endpoint on the inference server:: $ /opt/tensorrtserver/bin/image_client -i grpc -u localhost:8001 -m resnet50_netdef -s INCEPTION qa/images/mug.jpg Request 0, batch size 1 Image '../qa/images/mug.jpg': 504 (COFFEE MUG) = 0.723991 By default the client prints the most probable classification for the image. Use the \-c flag to see more classifications:: $ /opt/tensorrtserver/bin/image_client -m resnet50_netdef -s INCEPTION -c 3 qa/images/mug.jpg Request 0, batch size 1 Image '../qa/images/mug.jpg': 504 (COFFEE MUG) = 0.723991 968 (CUP) = 0.270953 967 (ESPRESSO) = 0.00115996 The \-b flag allows you to send a batch of images for inferencing. The image\_client application will form the batch from the image or images that you specified. If the batch is bigger than the number of images then image\_client will just repeat the images to fill the batch:: $ /opt/tensorrtserver/bin/image_client -m resnet50_netdef -s INCEPTION -c 3 -b 2 qa/images/mug.jpg Request 0, batch size 2 Image '../qa/images/mug.jpg': 504 (COFFEE MUG) = 0.778078556061 968 (CUP) = 0.213262036443 967 (ESPRESSO) = 0.00293014757335 Image '../qa/images/mug.jpg': 504 (COFFEE MUG) = 0.778078556061 968 (CUP) = 0.213262036443 967 (ESPRESSO) = 0.00293014757335 Provide a directory instead of a single image to perform inferencing on all images in the directory:: $ /opt/tensorrtserver/bin/image_client -m resnet50_netdef -s INCEPTION -c 3 -b 2 qa/images Request 0, batch size 2 Image '../qa/images/car.jpg': 817 (SPORTS CAR) = 0.836187 511 (CONVERTIBLE) = 0.0708251 751 (RACER) = 0.0597549 Image '../qa/images/mug.jpg': 504 (COFFEE MUG) = 0.723991 968 (CUP) = 0.270953 967 (ESPRESSO) = 0.00115996 Request 1, batch size 2 Image '../qa/images/vulture.jpeg': 23 (VULTURE) = 0.992326 8 (HEN) = 0.00231854 84 (PEACOCK) = 0.00201471 Image '../qa/images/car.jpg': 817 (SPORTS CAR) = 0.836187 511 (CONVERTIBLE) = 0.0708251 751 (RACER) = 0.0597549 The grpc\_image\_client.py application at available at `src/clients/python/grpc\_image\_client.py `_ behaves the same as the image\_client except that instead of using the inference server client library it uses the GRPC generated client library to communicate with the server. Performance Example Application ------------------------------- The perf\_client example application located at `src/clients/c++/perf\_client.cc `_ uses the C++ client API to send concurrent requests to the server to measure latency and inferences per second under varying client loads. To use perf\_client you must first have a running inference server that is serving one or more models. The perf\_client application works with any type of model by sending random data for all input tensors and by reading and ignoring all output tensors. If you don't have a model repository see :ref:`section-example-model-repository` for instructions on how to create one. Follow the instructions in :ref:`section-running-the-inference-server` to launch the inference server using the model repository. The perf\_client application has two major modes. In the first mode you specify how many concurrent clients you want to simulate and perf\_client finds a stable latency and inferences/second for that level of concurrency. Use the \-t flag to control concurrency and \-v to see verbose output. The following example simulates four clients continuously sending requests to the inference server:: $ /opt/tensorrtserver/bin/perf_client -m resnet50_netdef -p3000 -t4 -v *** Measurement Settings *** Batch size: 1 Measurement window: 3000 msec Request concurrency: 4 Pass [1] throughput: 207 infer/sec. Avg latency: 19268 usec (std 910 usec) Pass [2] throughput: 206 infer/sec. Avg latency: 19362 usec (std 941 usec) Pass [3] throughput: 208 infer/sec. Avg latency: 19252 usec (std 841 usec) Client: Request count: 624 Throughput: 208 infer/sec Avg latency: 19252 usec (standard deviation 841 usec) Avg HTTP time: 19224 usec (send 714 usec + response wait 18486 usec + receive 24 usec) Server: Request count: 749 Avg request latency: 17886 usec (overhead 55 usec + queue 26 usec + compute 17805 usec) In the second mode perf\_client will generate an inferences/second vs. latency curve by increasing concurrency until a specific latency limit or concurrency limit is reached. This mode is enabled by using the \-d option and \-l to specify the latency limit and optionally the \-c to specify a maximum concurrency limit:: $ /opt/tensorrtserver/bin/perf_client -m resnet50_netdef -p3000 -d -l50 -c 3 *** Measurement Settings *** Batch size: 1 Measurement window: 3000 msec Latency limit: 50 msec Concurrency limit: 3 concurrent requests Request concurrency: 1 Client: Request count: 327 Throughput: 109 infer/sec Avg latency: 9191 usec (standard deviation 822 usec) Avg HTTP time: 9188 usec (send/recv 1007 usec + response wait 8181 usec) Server: Request count: 391 Avg request latency: 7661 usec (overhead 90 usec + queue 68 usec + compute 7503 usec) Request concurrency: 2 Client: Request count: 521 Throughput: 173 infer/sec Avg latency: 11523 usec (standard deviation 616 usec) Avg HTTP time: 11448 usec (send/recv 711 usec + response wait 10737 usec) Server: Request count: 629 Avg request latency: 10018 usec (overhead 70 usec + queue 41 usec + compute 9907 usec) Request concurrency: 3 Client: Request count: 580 Throughput: 193 infer/sec Avg latency: 15518 usec (standard deviation 635 usec) Avg HTTP time: 15487 usec (send/recv 779 usec + response wait 14708 usec) Server: Request count: 697 Avg request latency: 14083 usec (overhead 59 usec + queue 30 usec + compute 13994 usec) Inferences/Second vs. Client Average Batch Latency Concurrency: 1, 109 infer/sec, latency 9191 usec Concurrency: 2, 173 infer/sec, latency 11523 usec Concurrency: 3, 193 infer/sec, latency 15518 usec Use the \-f flag to generate a file containing CSV output of the results:: $ /opt/tensorrtserver/bin/perf_client -m resnet50_netdef -p3000 -d -l50 -c 3 -f perf.csv You can then import the CSV file into a spreadsheet to help visualize the latency vs inferences/second tradeoff as well as see some components of the latency. Follow these steps: - Open `this spreadsheet `_ - Make a copy from the File menu "Make a copy..." - Open the copy - Select the A2 cell - From the File menu select "Import..." - Select "Upload" and upload the file - Select "Replace data at selected cell" and then select the "Import data" button .. _section-client-api: Client API ---------- The C++ client API exposes a class-based interface for querying server and model status and for performing inference. The commented interface is available at `src/clients/c++/request.h `_ and in the API Reference. The Python client API provides similar capabilities as the C++ API. The commented interface is available at `src/clients/python/\_\_init\_\_.py `_ and in the API Reference. A simple C++ example application at `src/clients/c++/simple\_client.cc `_ and a Python version at `src/clients/python/simple\_client.py `_ demonstrate basic client API usage. To run the the C++ version of the simple example, first build as described in :ref:`section-building-the-client-libraries-and-examples` and then:: $ /opt/tensorrtserver/bin/simple_client 0 + 1 = 1 0 - 1 = -1 1 + 1 = 2 1 - 1 = 0 2 + 1 = 3 2 - 1 = 1 3 + 1 = 4 3 - 1 = 2 4 + 1 = 5 4 - 1 = 3 5 + 1 = 6 5 - 1 = 4 6 + 1 = 7 6 - 1 = 5 7 + 1 = 8 7 - 1 = 6 8 + 1 = 9 8 - 1 = 7 9 + 1 = 10 9 - 1 = 8 10 + 1 = 11 10 - 1 = 9 11 + 1 = 12 11 - 1 = 10 12 + 1 = 13 12 - 1 = 11 13 + 1 = 14 13 - 1 = 12 14 + 1 = 15 14 - 1 = 13 15 + 1 = 16 15 - 1 = 14 To run the the Python version of the simple example, first build as described in :ref:`section-building-the-client-libraries-and-examples` and install the tensorrtserver whl, then:: $ python3 /workspace/src/clients/python/simple_client.py String Datatype ^^^^^^^^^^^^^^^ Some frameworks support tensors where each element in the tensor is a string (see :ref:`section-datatypes` for information on supported datatypes). For the most part, the Client API is identical for string and non-string tensors. One exception is that in the C++ API a string input tensor must be initialized with SetFromString() instead of SetRaw(). String tensors are demonstrated in the C++ example application at `src/clients/c++/simple\_string\_client.cc `_ and a Python version at `src/clients/python/simple\_string\_client.py `_.