Release Notes :: NVIDIA Deep Learning Triton Inference Server Documentation

The Triton Inference Server container image, release 20.10, is available on NGC and is open source on GitHub.

Contents of the Triton Inference Server container

The Triton Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tritonserver.

The container also includes the following:

Ubuntu 18.04 including Python 3.6
NVIDIA CUDA 11.1.0 including cuBLAS 11.2.1
NVIDIA cuDNN 8.0.4
NVIDIA NCCL 2.7.8 (optimized for NVLink™ )
MLNX_OFED
OpenMPI 3.1.6
TensorRT 7.2.1

Driver Requirements

Release 20.10 is based on NVIDIA CUDA 11.1.0, which requires NVIDIA Driver release 455 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx, 440.30, or 450.xx. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

GPU Requirements

Release 20.10 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the NVIDIA Pascal, Volta, Turing, and Ampere Architecture GPU families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.

A new Python backend allows Python code to run as a model within Triton. See https://github.com/triton-inference-server/python_backend.
A new DALI backend allows running pre-processing and augmentation pipelines within Triton. See https://github.com/triton-inference-server/dali_backend.
The perf_client application is renamed to perf_analyzer; functionality remains the same.
A new Model Analyzer project is started with the goal of providing analysis and guidance on how to best optimize single or multiple models within Triton. The initial release analyzes GPU memory usage. See https://github.com/triton-inference-server/model_analyzer.
Triton documentation now resides on GitHub and is reachable from https://github.com/triton-inference-server/server/blob/master/README.md.
Build process for Triton has changed, see https://github.com/triton-inference-server/server/blob/master/docs/build.md.
Triton backends are moving to separate repositories. In this release the TensorFlow, ONNX Runtime, Python and DALI backends are moved; see https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton.
Refer to the 20.10 column of the Frameworks Support Matrix for container image versions that the 20.10 inference server container is based on.
Ubuntu 18.04 with September 2020 updates.

NVIDIA Triton Inference Server Container Versions

The following table shows what versions of Ubuntu, CUDA, Triton Inference Server, and TensorRT are supported in each of the NVIDIA containers for Triton Inference Server. For older container versions, refer to the Frameworks Support Matrix.

Container Version	Triton Inference Server	Ubuntu	CUDA Toolkit	TensorRT
20.10	2.4.0	18.04	NVIDIA CUDA 11.1.0	TensorRT 7.2.1
20.09	2.3.0		NVIDIA CUDA 11.0.3	TensorRT 7.1.3
20.08	2.2.0		NVIDIA CUDA 11.0.3
20.07	1.15.0 2.1.0		NVIDIA CUDA 11.0.194
20.06	1.14.0 2.0.0		NVIDIA CUDA 11.0.167	TensorRT 7.1.2
20.03.1	1.13.0		NVIDIA CUDA 10.2.89	TensorRT 7.0.0
20.03	1.12.0
20.02 20.01	1.11.0
20.02 20.01	1.10.0
19.12 19.11	1.9.0			TensorRT 6.0.1
19.12 19.11	1.8.0
19.10	1.7.0		NVIDIA CUDA 10.1.243
19.09	1.6.0
19.08	1.5.0			TensorRT 5.1.5

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.