The Triton Inference Server is available as buildable source code, but the easiest way to install and run Triton is to use the pre-built Docker image available from the NVIDIA GPU Cloud (NGC).
Launching and maintaining Triton Inference Server revolves around the use of building model repositories. This tutorial will cover:
Creating a Model Repository
Send an Inference Request
Create A Model Repository#
The model repository is the directory where you place the models that you want Triton to serve. An example model repository is included in the docs/examples/model_repository. Before using the repository, you must fetch any missing model definition files from their public model zoos via the provided script.
$ cd docs/examples $ ./fetch_models.sh
Triton is optimized to provide the best inferencing performance by using GPUs, but it can also work on CPU-only systems. In both cases you can use the same Triton Docker image.
Run on System with GPUs#
Use the following command to run Triton with the example model repository you just created. The NVIDIA Container Toolkit must be installed for Docker to recognize the GPU(s). The –gpus=1 flag indicates that 1 system GPU should be made available to Triton for inferencing.
$ docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models
Where <xx.yy> is the version of Triton that you want to use (and pulled above). After you start Triton you will see output on the console showing the server starting up and loading the model. When you see output like the following, Triton is ready to accept inference requests.
+----------------------+---------+--------+ | Model | Version | Status | +----------------------+---------+--------+ | <model_name> | <v> | READY | | .. | . | .. | | .. | . | .. | +----------------------+---------+--------+ ... ... ... I1002 21:58:57.891440 62 grpc_server.cc:3914] Started GRPCInferenceService at 0.0.0.0:8001 I1002 21:58:57.893177 62 http_server.cc:2717] Started HTTPService at 0.0.0.0:8000 I1002 21:58:57.935518 62 http_server.cc:2736] Started Metrics Service at 0.0.0.0:8002
All the models should show “READY” status to indicate that they loaded correctly. If a model fails to load the status will report the failure and a reason for the failure. If your model is not displayed in the table check the path to the model repository and your CUDA drivers.
Run on CPU-Only System#
On a system without GPUs, Triton should be run without using the –gpus flag to Docker, but is otherwise identical to what is described above.
$ docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models
Because the –gpus flag is not used, a GPU is not available and Triton will therefore be unable to load any model configuration that requires a GPU.
Verify Triton Is Running Correctly#
Use Triton’s ready endpoint to verify that the server and the models are ready for inference. From the host system use curl to access the HTTP endpoint that indicates server status.
$ curl -v localhost:8000/v2/health/ready ... < HTTP/1.1 200 OK < Content-Length: 0 < Content-Type: text/plain
The HTTP request returns status 200 if Triton is ready and non-200 if it is not ready.
Send an Inference Request#
Use docker pull to get the client libraries and examples image from NGC.
$ docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
Where <xx.yy> is the version that you want to pull. Run the client image.
$ docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
From within the nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk image, run the example image-client application to perform image classification using the example densenet_onnx model.
To send a request for the densenet_onnx model use an image from the /workspace/images directory. In this case we ask for the top 3 classifications.
$ /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg Request 0, batch size 1 Image '/workspace/images/mug.jpg': 15.346230 (504) = COFFEE MUG 13.224326 (968) = CUP 10.422965 (505) = COFFEEPOT