Triton Inference Server Release 19.05

The TensorRT Inference Server container image, release 19.05, is available on NGC and is open source on GitHub.

Contents of the Triton inference server container

The TensorRT Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tensorrtserver.

Driver Requirements

Release 19.05 is based on CUDA 10.1 Update 1, which requires NVIDIA Driver release 418.xx. However, if you are running on Tesla (Tesla V100, Tesla P4, Tesla P40, or Tesla P100), you may use NVIDIA driver release 384.111+ or 410. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

GPU Requirements

Release 19.05 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the Pascal, Volta, and Turing families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.
  • The inference server container image version 19.05 is based on NVIDIA TensorRT Inference Server 1.2.0, TensorFlow 1.13.1, and Caffe2 0.8.2.
  • Latest version of NVIDIA CUDA 10.1 Update 1 including cuBLAS 10.1 Update 1
  • Latest version of NVIDIA cuDNN 7.6.0
  • Latest version of TensorRT 5.1.5
  • Ensembling is now available. An ensemble represents a pipeline of one or more models and the connection of input and output tensors between those models. A single inference request to an ensemble will trigger the execution of the entire pipeline.
  • Added a Helm chart that deploys a single TensorRT Inference Server into a Kubernetes cluster.
  • The client Makefile now supports building for both Ubuntu 18.04 and Ubuntu 16.04. The Python wheel produced from the build is now compatible with both Python2 and Python3.
  • The perf_client application now has a --percentile flag that can be used to report latencies instead of reporting average latency (which remains the default). For example, using --percentile=99 causes perf_client to report the 99th percentile latency.
  • The perf_client application now has a -z option to use zero-valued input tensors instead of random values.
  • Improved error reporting of incorrect input/output tensor names for TensorRT models.
  • Added --allow-gpu-metrics option to enable/disable reporting of GPU metrics.

Known Issues

There are no known issues in this release.