Triton Inference Server Release 20.03
Contents of the Triton Inference Server container
The Triton Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tritonserver.
- Ubuntu 18.04 including Python 3.6
- NVIDIA CUDA 10.2.89 including cuBLAS 10.2.2.89
- NVIDIA cuDNN 7.6.5
- NVIDIA NCCL 2.6.3 (optimized for NVLink™ )
- MLNX_OFED
- OpenMPI 3.1.4
- TensorRT 7.0.0
Driver Requirements
Release 20.03 is based on NVIDIA CUDA 10.2.89, which requires NVIDIA Driver release 440.33.01. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 396, 384.111+, 410, 418.xx or 440.30. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.
GPU Requirements
Release 20.03 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the Pascal, Volta, and Turing families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.
Key Features and Enhancements
- Added queuing policies for dynamic batching scheduler. These policies are specified in the model configuration and allow each model to set maximum queue size, time outs, and priority levels for inference requests.
- Support for large ONNX models where weights are stored in separate files.
- Allow ONNX Runtime optimization level to be configured via the model configuration optimization setting.
- Experimental Python client and server support for community standard GRPC inferencing API.
- Added --min-supported-compute-capability flag to allow Triton Server to use older, unsupported GPUs.
- Fixed perf_client shared memory support. In some cases the shared-memory option did not work correctly due to the input and output tensor names. This issue is now resolved.
- Refer to the Frameworks Support Matrix for container image versions that the 20.03 inference server container is based on.
- The inference server container image version 20.03 is additionally based on ONNX Runtime 1.1.1.
- Ubuntu 18.04 with February 2020 updates
NVIDIA Triton Inference Server Container Versions
The following table shows what versions of Ubuntu, CUDA, Triton Inference Server, and TensorRT are supported in each of the NVIDIA containers for Triton Inference Server. For older container versions, refer to the Frameworks Support Matrix.
Container Version | Triton Inference Server | Ubuntu | CUDA Toolkit | TensorRT |
---|---|---|---|---|
20.03 | 1.12.0 |
18.04 16.04 |
NVIDIA CUDA 10.2.89 | TensorRT 7.0.0 |
1.11.0 | ||||
1.10.0 | ||||
1.9.0 | TensorRT 6.0.1 | |||
1.8.0 | ||||
19.10 | 1.7.0 | NVIDIA CUDA 10.1.243 | ||
19.09 | 1.6.0 | |||
19.08 | 1.5.0 | TensorRT 5.1.5 |
Known Issues
-
TensorRT reformat-free I/O is not supported.
-
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.