Release Notes :: NVIDIA Deep Learning Triton Inference Server Documentation

The Triton Inference Server container image, release 21.03, is available on NGC and is open source on GitHub.

Contents of the Triton Inference Server container

The Triton Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tritonserver.

The container also includes the following:

Ubuntu 20.04 including Python 3.8
NVIDIA CUDA 11.2.1 including cuBLAS 11.4.1.1026
NVIDIA cuDNN 8.1.1
NVIDIA NCCL 2.8.4 (optimized for NVLink™ )
MLNX_OFED 5.1
OpenMPI 4.0.5
TensorRT 7.2.2.3

Driver Requirements

Release 21.03 is based on NVIDIA CUDA 11.2.1, which requires NVIDIA Driver release 460.32.03 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51(or later R450). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

GPU Requirements

Release 21.03 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the NVIDIA Pascal, Volta, Turing, and Ampere Architecture GPU families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.

Repository agent is a new extensibility C API added to Triton that allows implementation of custom authentication, decryption, conversion, or similar operations when a model is loaded. See https://github.com/triton-inference-server/server/blob/master/docs/repository_agents.md.
An OpenVINO backend is added to Triton to enable the execution of OpenVINO models on CPUs. See https://github.com/triton-inference-server/openvino_backend.
The PyTorch backend is now maintained in its own repository: https://github.com/triton-inference-server/pytorch_backend
The ONNX Runtime backend is now maintained in its own repository: https://github.com/triton-inference-server/onnxruntime_backend
The Jetson release of Triton now supports the shared-memory protocol between clients and the Triton server.
SSL/TLS Mutual Authentication support is added to the GRPC client library.
A new Model Configuration option, "gather_kernel_buffer_threshold", can be specified to instruct Triton to use a CUDA kernel to gather inputs buffers onto the GPU. Using this option can improve inference performance for some models.
The Python client libraries have been improved to more efficiently create numpy arrays for input and output tensors.
The client libraries examples have been improved to more clearly describe how string and byte-blob tensors are supported by the Python Client API. See https://github.com/triton-inference-server/server/blob/master/docs/client_examples.md.
Ubuntu 20.04 with February 2021 updates.

NVIDIA Triton Inference Server Container Versions

The following table shows what versions of Ubuntu, CUDA, Triton Inference Server, and TensorRT are supported in each of the NVIDIA containers for Triton Inference Server. For older container versions, refer to the Frameworks Support Matrix.

Container Version	Triton Inference Server	Ubuntu	CUDA Toolkit	TensorRT
21.03	2.8.0	20.04	NVIDIA CUDA 11.2.1	TensorRT 7.2.2.3
21.02	2.7.0		NVIDIA CUDA 11.2.0	TensorRT 7.2.2.3+cuda11.1.0.024
20.12	2.6.0		NVIDIA CUDA 11.1.1	TensorRT 7.2.2
20.11	2.5.0	18.04	NVIDIA CUDA 11.1.0	TensorRT 7.2.1
20.10	2.4.0		NVIDIA CUDA 11.1.0	TensorRT 7.2.1
20.09	2.3.0		NVIDIA CUDA 11.0.3	TensorRT 7.1.3
20.08	2.2.0		NVIDIA CUDA 11.0.3
20.07	1.15.0 2.1.0		NVIDIA CUDA 11.0.194
20.06	1.14.0 2.0.0		NVIDIA CUDA 11.0.167	TensorRT 7.1.2
20.03.1	1.13.0		NVIDIA CUDA 10.2.89	TensorRT 7.0.0
20.03	1.12.0
20.02 20.01	1.11.0
20.02 20.01	1.10.0
19.12 19.11	1.9.0			TensorRT 6.0.1
19.12 19.11	1.8.0
19.10	1.7.0		NVIDIA CUDA 10.1.243
19.09	1.6.0
19.08	1.5.0			TensorRT 5.1.5

Known Issues

Triton’s TensorRT support depends on the input-consumed feature of TensorRT. In some rare cases using TensorRT 8.0 and earlier versions, the input-consumed event fires earlier than expected, causing Triton to overwrite input tensors while they are still in use and leading to corrupt input data being used for inference. This situation occurs when the inputs feed directly into a TensorRT layer that is optimized into a ForeignNode in the builder log. If you encounter accuracy issues with your TensorRT model, you can work around the issue by enabling the output_copy_stream option in your model’s configuration (https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto#L816).
There are backwards incompatible changes in the example Python client shared-memory support library when that library is used for tensors of type BYTES. The utils.serialize_byte_tensor() and utils.deserialize_byte_tensor() functions now return np.object_ numpy arrays where previously they returned np.bytes_ numpy arrays. Code depending on np.bytes_ must be updated. This change was necessary because the np.bytes_ type removes all trailing zeros from each array element and so binary sequences ending in zero(s) could not be represented with the old behavior. Correct usage of the Python client shared-memory support library is shown inhttps://github.com/triton-inference-server/server/blob/r21.03/src/clients/python/examples/simple_http_shm_string_client.py.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.