Release Notes :: NVIDIA Deep Learning Triton Inference Server Documentation

The Triton Inference Server container image, release 21.02, is available on NGC and is open source on GitHub.

Contents of the Triton Inference Server container

The Triton Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tritonserver.

The container also includes the following:

Ubuntu 20.04 including Python 3.8
NVIDIA CUDA 11.2.0 including cuBLAS 11.3.1
NVIDIA cuDNN 8.0.5
NVIDIA NCCL 2.8.4 (optimized for NVLink™ )
MLNX_OFED 5.1
OpenMPI 4.0.5
TensorRT 7.2.2

Driver Requirements

Release 21.02 is based on NVIDIA CUDA 11.2.0, which requires NVIDIA Driver release 460.27.04 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51(or later R450). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

GPU Requirements

Release 21.02 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the NVIDIA Pascal, Volta, Turing, and Ampere Architecture GPU families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.

Refer to the 21.02 column of the Frameworks Support Matrix for container image versions that the 21.02 inference server container is based on.
Fixed a bug in TensorRT backend that could, in rare cases, lead to corruption of output tensors.
Fixed a performance issue in the HTTP/REST client that occurred when the client does not explicitly request specific outputs. In this case all outputs are now returned as binary data where previously they were returned as JSON.
Added an example Java and Scala client based on GRPC-generated API.
Extended perf_analyzer to be able to work with TFServing and TorchServe.
The legacy custom backend API is deprecated and will be removed in a future release. The Triton Backend API should be used as the API for custom backends. The Triton Backend API remains fully supported and that support will continue indefinitely.
Model Analyzer parameters and test model configurations can be specified with JSON configuration file.
Model Analyzer will report performance metrics for end-to-end latency and CPU memory usage.
Ubuntu 20.04 with January 2021 updates.

NVIDIA Triton Inference Server Container Versions

The following table shows what versions of Ubuntu, CUDA, Triton Inference Server, and TensorRT are supported in each of the NVIDIA containers for Triton Inference Server. For older container versions, refer to the Frameworks Support Matrix.

Container Version	Triton Inference Server	Ubuntu	CUDA Toolkit	TensorRT
21.02	2.7.0	20.04	NVIDIA CUDA 11.2.0	TensorRT 7.2.2.3+cuda11.1.0.024
20.12	2.6.0	20.04	NVIDIA CUDA 11.1.1	TensorRT 7.2.2
20.11	2.5.0	18.04	NVIDIA CUDA 11.1.0	TensorRT 7.2.1
20.10	2.4.0		NVIDIA CUDA 11.1.0	TensorRT 7.2.1
20.09	2.3.0		NVIDIA CUDA 11.0.3	TensorRT 7.1.3
20.08	2.2.0		NVIDIA CUDA 11.0.3
20.07	1.15.0 2.1.0		NVIDIA CUDA 11.0.194
20.06	1.14.0 2.0.0		NVIDIA CUDA 11.0.167	TensorRT 7.1.2
20.03.1	1.13.0		NVIDIA CUDA 10.2.89	TensorRT 7.0.0
20.03	1.12.0
20.02 20.01	1.11.0
20.02 20.01	1.10.0
19.12 19.11	1.9.0			TensorRT 6.0.1
19.12 19.11	1.8.0
19.10	1.7.0		NVIDIA CUDA 10.1.243
19.09	1.6.0
19.08	1.5.0			TensorRT 5.1.5

Known Issues

Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.
Observed memory leak in gRPC client library. Suggested work around process: Restart service to free memory or run within Kubernetes with failover mechanism. For more details on the issue in gRPC, please reference: https://github.com/triton-inference-server/server/issues/2517. The memory leak is fixed on master branch by https://github.com/triton-inference-server/server/pull/2533 and the fix will be included in the 21.03 release. If required, the change can be applied to the 21.02 branch and the client library can be rebuilt: https://github.com/triton-inference-server/server/blob/master/docs/client_libraries.md.