Client Examples

After you have Triton running you can send inference and other requests to it using the HTTP/REST or GRPC protocols from your client application.

To simplify communication with Triton, the Triton project provides C++ and Python client libraries, and several example applications that show how to use these libraries.

  • C++ and Python versions of image_client, an example application that uses the C++ or Python client library to execute image classification models on Triton. See Image Classification Example Application.

  • Several simple C++ examples show how to use the C++ library to communicate with Triton to perform inferencing and other task. The C++ examples demonstrating the HTTP/REST client are named with a simple_http_ prefix and the examples demonstrating the GRPC client are named with a simple_grpc_ prefix.

  • Several simple Python examples show how to use the Python library to communicate with Triton to perform inferencing and other task. The Python examples demonstrating the HTTP/REST client are named with a simple_http_ prefix and the examples demonstrating the GRPC client are named with a simple_grpc_ prefix.

  • A couple of Python examples that communicate with Triton using a Python GRPC API generated by the protoc compiler. grpc_client.py is a simple example that shows simple API usage. grpc_image_client.py is functionally equivalent to image_client but that uses a generated GRPC client stub to communicate with Triton.

  • The protoc compiler can generate a GRPC API in a large number of programming languages. Examples for languages other than Python can be found in subdirectories of src/clients.

Getting the Client Examples

The provided Dockerfile.client and CMake support can be used to build the examples, or the pre-built examples can be downloaded from GitHub or a pre-built Docker image containing the client libraries from NVIDIA GPU Cloud (NGC).

Build Using Dockerfile

To build the examples using Docker follow the description in Build Using Dockerfile.

After the build completes the tritonserver_client docker image will contain the built client examples, and will also be configured with all the dependencies required to run those examples within the container. The easiest way to try the examples described in the following sections is to run the client image with --net=host so that the client examples can access Triton running in its own container. To be able to use system shared memory you need to run the client and server image with --ipc=host so that Triton can access the system shared memory in the client container. Additionally, to create system shared memory regions that are larger than 64MB, the --shm-size flag is needed while running the client image. To be able to use CUDA shared memory you need to use the appropriate Docker arguments when running the client image. (see Running Triton for more information about running Triton):

$ docker run -it --rm --net=host tritonserver_client

In the tritonserver_client image you can find the example executables in /workspace/install/bin, and the Python examples in /workspace/install/python.

Build Using CMake

To build the examples using CMake follow the description in Build Using CMake.

Ubuntu 18.04

When the build completes, the examples can be found in client/install. To use the examples, you need to include the path to the client library in environment variable “LD_LIBRARY_PATH”. By default it is /path/to/tritonserver/repo/build/client/install/lib. In addition to that, you also need to install the client Python packages and other packages required by the examples:

$ pip3 install --upgrade client/install/python/triton*.whl numpy pillow

Windows 10

When the build completes the examples can be found in client/install. The C++ client examples will not be generated as those examples have not yet been ported to Windows. However, you can use the Python examples to test if the build is successful. To use the Python examples, you need to install the Python wheels:

> pip3 install --upgrade client/install/python/triton*.whl numpy pillow

Download From GitHub

To download the examples follow the description in Download From GitHub.

To use the C++ examples you must install some dependencies. For Ubuntu 18.04:

$ apt-get update
$ apt-get install curl libcurl4-openssl-dev

The Python examples require that you additionally install the wheel files and some other dependencies:

$ apt-get install python python-pip
$ pip3 install --user --upgrade python/triton*.whl numpy pillow

The C++ image_client example uses OpenCV for image manipulation so for that example you must install the following:

$ apt-get install libopencv-dev libopencv-core-dev

Download Docker Image From NGC

To download the Docker image follow the description in Download Docker Image From NGC.

The docker image contains the built client examples and will also be configured with all the dependencies required to run those examples within the container. The easiest way to try the examples described in the following sections is to run the client image with --net=host so that the client examples can access Triton running in its own container. To be able to use system shared memory you need to run the client and server image with --ipc=host so that Triton can access the system shared memory in the client container. Additionally, to create system shared memory regions that are larger than 64MB, the --shm-size flag is needed while running the client image. To be able to use CUDA shared memory you need to use the appropriate Docker arguments when running the client image. (see Running Triton for more information about running Triton):

$ docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:<xx.yy>-py3-clientsdk

In the image you can find the example executables in /workspace/install/bin, and the Python examples in /workspace/install/python.

Simple Example Applications

This section describes several of the simple example applications and the features that they illustrate.

String Datatype

Some frameworks support tensors where each element in the tensor is a string (see Datatypes for information on supported datatypes).

String tensors are demonstrated in the C++ example applications simple_http_string_infer_client.cc and simple_grpc_string_infer_client.cc. String tensors are demonstrated in the Python example application simple_http_string_infer_client.py and simple_grpc_string_infer_client.py.

System Shared Memory

Using system shared memory to communicate tensors between the client library and Triton can significantly improve performance in some cases. Using system shared memory is demonstrated in the C++ example applications simple_http_shm_client.cc and simple_grpc_shm_client.cc. Using system shared memory is demonstrated in the Python example application simple_http_shm_client.py and simple_grpc_shm_client.py.

Python does not have a standard way of allocating and accessing shared memory so as an example a simple system shared memory module is provided that can be used with the Python client library to create, set and destroy system shared memory.

CUDA Shared Memory

Using CUDA shared memory to communicate tensors between the client library and Triton can significantly improve performance in some cases. Using CUDA shared memory is demonstrated in the C++ example applications simple_http_cudashm_client.cc and simple_grpc_cudashm_client.cc. Using CUDA shared memory is demonstrated in the Python example application simple_http_cudashm_client.py and simple_grpc_cudashm_client.py.

Python does not have a standard way of allocating and accessing shared memory so as an example a simple CUDA shared memory module is provided that can be used with the Python client library to create, set and destroy CUDA shared memory.

Client API for Stateful Models

When performing inference using a stateful model, a client must identify which inference requests belong to the same sequence and also when a sequence starts and ends.

Each sequence is identified with a sequence ID that is provided when an inference request is made. It is up to the clients to create a unique sequence ID. For each sequence the first inference request should be marked as the start of the sequence and the last inference requests should be marked as the end of the sequence.

The use of sequence ID and start and end flags are demonstrated in the C++ example applications simple_http_sequence_stream_infer_client.cc and simple_grpc_sequence_stream_infer_client.cc. The use of sequence ID and start and end flags are demonstrated in the Python example application simple_http_sequence_stream_infer_client.py and simple_grpc_sequence_stream_infer_client.py.

Image Classification Example Application

The image classification example that uses the C++ client API is available at src/clients/c++/examples/image_client.cc. The Python version of the image classification client is available at src/clients/python/examples/image_client.py.

To use image_client (or image_client.py) you must first have a running Triton that is serving one or more image classification models. The image_client application requires that the model have a single image input and produce a single classification output. If you don’t have a model repository with image classification models see Example Model Repository for instructions on how to create one.

Follow the instructions in Running Triton to launch Triton using the model repository. Once Triton is running you can use the image_client application to send inference requests. You can specify a single image or a directory holding images. Here we send a request for the resnet50_netdef model from the example model repository for an image from the qa/images directory:

$ image_client -m resnet50_netdef -s INCEPTION qa/images/mug.jpg
Request 0, batch size 1
Image '../qa/images/mug.jpg':
    504 (COFFEE MUG) = 0.723991

The Python version of the application accepts the same command-line arguments:

$ python image_client.py -m resnet50_netdef -s INCEPTION qa/images/mug.jpg
Request 0, batch size 1
Image '../qa/images/mug.jpg':
    504 (COFFEE MUG) = 0.778078556061

The image_client and image_client.py applications use the client library to talk to Triton. By default image_client instructs the client library to use HTTP/REST protocol, but you can use the GRPC protocol by providing the -i flag. You must also use the -u flag to point at the GRPC endpoint on Triton:

$ image_client -i grpc -u localhost:8001 -m resnet50_netdef -s INCEPTION qa/images/mug.jpg
Request 0, batch size 1
Image '../qa/images/mug.jpg':
    504 (COFFEE MUG) = 0.723991

By default the client prints the most probable classification for the image. Use the -c flag to see more classifications:

$ image_client -m resnet50_netdef -s INCEPTION -c 3 qa/images/mug.jpg
Request 0, batch size 1
Image '../qa/images/mug.jpg':
    504 (COFFEE MUG) = 0.723991
    968 (CUP) = 0.270953
    967 (ESPRESSO) = 0.00115996

The -b flag allows you to send a batch of images for inferencing. The image_client application will form the batch from the image or images that you specified. If the batch is bigger than the number of images then image_client will just repeat the images to fill the batch:

$ image_client -m resnet50_netdef -s INCEPTION -c 3 -b 2 qa/images/mug.jpg
Request 0, batch size 2
Image '../qa/images/mug.jpg':
    504 (COFFEE MUG) = 0.778078556061
    968 (CUP) = 0.213262036443
    967 (ESPRESSO) = 0.00293014757335
Image '../qa/images/mug.jpg':
    504 (COFFEE MUG) = 0.778078556061
    968 (CUP) = 0.213262036443
    967 (ESPRESSO) = 0.00293014757335

Provide a directory instead of a single image to perform inferencing on all images in the directory:

$ image_client -m resnet50_netdef -s INCEPTION -c 3 -b 2 qa/images
Request 0, batch size 2
Image '../qa/images/car.jpg':
    817 (SPORTS CAR) = 0.836187
    511 (CONVERTIBLE) = 0.0708251
    751 (RACER) = 0.0597549
Image '../qa/images/mug.jpg':
    504 (COFFEE MUG) = 0.723991
    968 (CUP) = 0.270953
    967 (ESPRESSO) = 0.00115996
Request 1, batch size 2
Image '../qa/images/vulture.jpeg':
    23 (VULTURE) = 0.992326
    8 (HEN) = 0.00231854
    84 (PEACOCK) = 0.00201471
Image '../qa/images/car.jpg':
    817 (SPORTS CAR) = 0.836187
    511 (CONVERTIBLE) = 0.0708251
    751 (RACER) = 0.0597549

The /grpc_image_client.py application behaves the same as the image_client except that instead of using the client library it uses the GRPC generated library to communicate with Triton.

Ensemble Image Classification Example Application

In comparison to the image classification example above, this example uses an ensemble of an image-preprocessing model implemented as a custom backend and a Caffe2 ResNet50 model. This ensemble allows you to send the raw image binaries in the request and receive classification results without preprocessing the images on the client. The ensemble image classification example that uses the C++ client API is available at src/clients/c++/examples/ensemble_image_client.cc. The Python version of the image classification client is available at src/clients/python/examples/ensemble_image_client.py.

To use ensemble_image_client (or ensemble_image_client.py) you must first have a running Triton that is serving the “preprocess_resnet50_ensemble” model and the models it depends on. The models are provided in an example ensemble model repository. See Example Model Repository for instructions on how to create one.

Follow the instructions in Running Triton to launch Truton using the ensemble model repository. Once Triton is running you can use the ensemble_image_client application to send inference requests. You can specify a single image or a directory holding images. Here we send a request for the ensemble from the example ensemble model repository for an image from the qa/images directory:

$ ensemble_image_client qa/images/mug.jpg
Image 'qa/images/mug.jpg':
    504 (COFFEE MUG) = 0.723991

The Python version of the application accepts the same command-line arguments:

$ python ensemble_image_client.py qa/images/mug.jpg
Image 'qa/images/mug.jpg':
    504 (COFFEE MUG) = 0.778078556061

Similar to image_client, by default ensemble_image_client instructs the client library to use HTTP protocol to talk to Triton, but you can use GRPC protocol by providing the -i flag. You must also use the -u flag to point at the GRPC endpoint on Triton:

$ ensemble_image_client -i grpc -u localhost:8001 qa/images/mug.jpg
Image 'qa/images/mug.jpg':
    504 (COFFEE MUG) = 0.723991

By default the client prints the most probable classification for the image. Use the -c flag to see more classifications:

$ ensemble_image_client -c 3 qa/images/mug.jpg
Image 'qa/images/mug.jpg':
    504 (COFFEE MUG) = 0.723991
    968 (CUP) = 0.270953
    967 (ESPRESSO) = 0.00115996

Provide a directory instead of a single image to perform inferencing on all images in the directory. If the number of images exceeds the maximum batch size of the ensemble, only the images within the maximum batch size will be sent:

$ ensemble_image_client -c 3 qa/images
Image 'qa/images/car.jpg':
    817 (SPORTS CAR) = 0.836187
    511 (CONVERTIBLE) = 0.0708251
    751 (RACER) = 0.0597549
Image 'qa/images/mug.jpg':
    504 (COFFEE MUG) = 0.723991
    968 (CUP) = 0.270953
    967 (ESPRESSO) = 0.00115996
Image 'qa/images/vulture.jpeg':
    23 (VULTURE) = 0.992326
    8 (HEN) = 0.00231854
    84 (PEACOCK) = 0.00201471