Triton Inference Server Release 20.07
The Triton Inference Server container image, release 20.07, is available on NGC and is open source on GitHub.
Contents of the Triton Inference Server container
The Triton Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tritonserver.
- Ubuntu 18.04 including Python 3.6
- NVIDIA CUDA 11.0.194 including cuBLAS 11.1.0
- NVIDIA cuDNN 8.0.1
- NVIDIA NCCL 2.7.6 (optimized for NVLink™ )
- MLNX_OFED
- OpenMPI 3.1.6
- TensorRT 7.1.3
Driver Requirements
Release 20.07 is based on NVIDIA CUDA 11.0.194, which requires NVIDIA Driver release 450 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.30. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.
GPU Requirements
Release 20.07 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the NVIDIA Pascal, Volta, Turing, and Ampere Architecture GPU families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.
Key Features and Enhancements
- For Triton V2, add TensorFlow optimization option that enables automatic FP16 optimization of the model.
- For Triton V2, the PyTorch backend now includes support for TorchVision operations.
- This release includes support for both the new KFServing based protocols as well as the legacy V1 protocols.
- Support for the new KFServing HTTP/REST, GRPC and corresponding client libraries is released on GitHub branch r20.07 and as NGC container 20.07-py3.
- Support for the legacy V1 HTTP/REST, GRPC and corresponding client libraries is released on GitHub branch r20.07-v1 and as NGC container 20.07-v1-py3.
- Migration from Triton V1 to Triton V2 requires significant changes; see the “Backwards Compatibility” and “Roadmap” sections of the GitHub README for more information.
- Refer to the 20.07 column of the Frameworks Support Matrix for container image versions that the 20.07 inference server container is based on.
- Ubuntu 18.04 with June 2020 updates.
NVIDIA Triton Inference Server Container Versions
The following table shows what versions of Ubuntu, CUDA, Triton Inference Server, and TensorRT are supported in each of the NVIDIA containers for Triton Inference Server. For older container versions, refer to the Frameworks Support Matrix.
Container Version | Triton Inference Server | Ubuntu | CUDA Toolkit | TensorRT |
---|---|---|---|---|
20.07 | 1.15.0 |
18.04 |
NVIDIA CUDA 11.0.194 | TensorRT 7.1.3 |
20.06 | 1.14.0 | NVIDIA CUDA 11.0.167 | TensorRT 7.1.2 | |
20.03.1 | 1.13.0 | NVIDIA CUDA 10.2.89 | TensorRT 7.0.0 | |
20.03 | 1.12.0 | |||
1.11.0 | ||||
1.10.0 | ||||
1.9.0 | TensorRT 6.0.1 | |||
1.8.0 | ||||
19.10 | 1.7.0 | NVIDIA CUDA 10.1.243 | ||
19.09 | 1.6.0 | |||
19.08 | 1.5.0 | TensorRT 5.1.5 |
Known Issues
- When using the TensorRT NGC container to generate TensorRT models for Triton, the 20.07.1 version of the TensorRT container must be used to ensure compatibility with Triton 20.07.
-
The KFServing HTTP/REST and GRPC protocols and corresponding Python and C++ clients are beta quality and are likely to change.
- The new C API specified in tritonserver.h is beta quality and is likely to change.
- TensorRT reformat-free I/O is not supported.
- Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.