Triton Inference Server Release 21.09

The Triton Inference Server container image, release 21.09, is available on NGC and is open source on GitHub.

Contents of the Triton Inference Server container

The Triton Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tritonserver.

The container also includes the following:

Driver Requirements

Release 21.09 is based on NVIDIA CUDA 11.4.2, which requires NVIDIA Driver release 470 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), or 460.27 (or later R460). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

GPU Requirements

Release 21.09 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the NVIDIA Pascal, Volta, Turing, and Ampere Architecture GPU families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.

  • Full-featured, beta version of Business Logic Scripting released.
  • Beta version for basic JAVA Client released. See https://github.com/triton-inference-server/client/tree/r21.09/src/java for a list of supported features.
  • A stack trace is now printed when Triton crashes to aid in debugging.
  • The Triton Client SDK wheel file is now available directly from PyPI for both Ubuntu and Windows.
  • The TensorRT backend is now an optional part of Triton just like all the other backends. The compose utility can be used to create a Triton container that does not contain the TensorRT backend.
  • Model Analyzer can profile with perf_analyzer's C-API.
  • Model Analyzer can use the CUDA Device Index in addition to the GPU UUID in the -gpus flag.
  • Refer to the 21.09 column of the Frameworks Support Matrix for container image versions that the 21.09 inference server container is based on.

Known Issues

  • Triton’s TensorRT support depends on the input-consumed feature of TensorRT. In some rare cases using TensorRT 8.0 and earlier versions, the input-consumed event fires earlier than expected, causing Triton to overwrite input tensors while they are still in use and leading to corrupt input data being used for inference. This situation occurs when the inputs feed directly into a TensorRT layer that is optimized into a ForeignNode in the builder log. If you encounter accuracy issues with your TensorRT model, you can work around the issue by enabling the output_copy_stream option in your model’s configuration (https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto#L816).
  • Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
  • Triton metrics may not work if the host machine is running a separate DCGM agent, either on bare-metal or in a container.
  • There is a known issue in TensorRT 8.0 regarding accuracy for a certain case of int8 inferencing on A40 and similar GPUs. The version of TF-TRT in TF2 21.09 includes a feature that works around this issue, but TF1 21.09 does not include that feature and therefore Triton users may experience the accuracy drop for a small subset of model/data type/batch size combinations on A40 when TF-TRT is used through the TF1 backend. This will be fixed in the next version of TensorRT.
  • Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: https://github.com/pytorch/pytorch/issues/27902.