Triton Inference Server Release 20.02

The TensorRT Inference Server container image, release 20.02, is available on NGC and is open source on GitHub.

Contents of the Triton inference server container

The TensorRT Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tensorrtserver.

Driver Requirements

Release 20.02 is based on NVIDIA CUDA 10.2.89, which requires NVIDIA Driver release 440.33.01. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 396, 384.111+, 410, 418.xx or 440.30. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

GPU Requirements

Release 20.02 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the Pascal, Volta, and Turing families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.
  • Refer to the Frameworks Support Matrix for container image versions that the 20.02 inference server container is based on.
  • The inference server container image version 20.02 is additionally based on ONNX Runtime 1.1.1.
  • The TensorRT backend is improved to have significantly better performance. Improvements include reducing thread contention, using pinned memory for faster CPU<->GPU transfers, and increasing compute and memory copy overlap on GPUs.
  • Reduce memory usage of TensorRT models in many cases by sharing weights across multiple model instances.
  • Boolean data-type and shape tensors are now supported for TensorRT models.
  • A new model configuration option allows the dynamic batcher to create “ragged” batches for custom backend models. A ragged batch is a batch where one or more of the input/output tensors have different shapes in different batch entries.
  • Local S3 storage endpoints are now supported for model repositories. A local S3 endpoint is specified as s3://host:port/path/to/repository.
  • The Helm chart showing an example Kubernetes deployment is updated to include Prometheus and Grafana support so that inference server metrics can be collected and visualized.
  • The inference server container no longer sets LD_LIBRARY_PATH, instead the server uses RUNPATH to locate its shared libraries.
  • Python 2 is end-of-life so all support has been removed. Python 3 is still supported.
  • Ubuntu 18.04 with January 2020 updates

NVIDIA TensorRT Inference Server Container Versions

The following table shows what versions of Ubuntu, CUDA, TensorRT Inference Server, and TensorRT are supported in each of the NVIDIA containers for TensorRT Inference Server. For older container versions, refer to the Frameworks Support Matrix.

Container Version Ubuntu CUDA Toolkit TensorRT Inference Server TensorRT
20.02

18.04

NVIDIA CUDA 10.2.89 1.12.0 TensorRT 7.0.0

20.01

1.11.0
1.10.0

19.12

19.11

1.9.0 TensorRT 6.0.1
1.8.0
19.10 NVIDIA CUDA 10.1.243 1.7.0
19.09 1.6.0
19.08 1.5.0 TensorRT 5.1.5

Known Issues

  • TensorRT reformat-free I/O is not supported.

  • Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.