Triton Inference Server Release 21.12

The Triton Inference Server container image, release 21.12, is available on NGC and is open source on GitHub.

Contents of the Triton Inference Server container

The Triton Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tritonserver.

The container also includes the following:

Driver Requirements

Release 21.12 is based on NVIDIA CUDA 11.5.0, which requires NVIDIA Driver release 495 or later. However, if you are running on a Data Center GPU (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), 460.27 (or later R460), or 470.57 (or later R470). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

GPU Requirements

Release 21.12 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the NVIDIA Pascal, Volta, Turing, and Ampere Architecture GPU families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.

  • Improved Inferentia support to use Neuron Runtime 2.x and multiple instances.
  • Models from MLflow can now be deployed to Triton with the MLflow plugin.
  • The preview release of TorchTRT models is now supported. PyTorch models optimized using TensorRT can now be loaded into Triton in the same way as other PyTorch models.
  • At the end of each Model Analyzer phase, an example command line will be printed to run the next phase.
  • Refer to the 21.12 column of the Frameworks Support Matrix for container image versions that the 21.12 inference server container is based on.

Known Issues

  • There was a bug in the GRPC protobuf implementation that was resolved by https://github.com/triton-inference-server/common/pull/34. If the client code uses the byte_contents field, the code must be updated to instead use 'bytes_contents'.
  • Triton pip wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton for ARM SBSA. The correct wheel file can be pulled directly from the ARM SBSA SDK image and manually installed.
  • Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. See https://github.com/pytorch/pytorch/issues/66930.
  • Triton’s TensorRT support depends on the input-consumed feature of TensorRT. In some rare cases using TensorRT 8.0 and earlier versions, the input-consumed event fires earlier than expected, causing Triton to overwrite input tensors while they are still in use and leading to corrupt input data being used for inference. This situation occurs when the inputs feed directly into a TensorRT layer that is optimized into a ForeignNode in the builder log. If you encounter accuracy issues with your TensorRT model, you can work around the issue by enabling the output_copy_stream option in your model’s configuration (https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto#L816).
  • Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
  • Triton metrics may not work if the host machine is running a separate DCGM agent, either on bare-metal or in a container.
  • Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: https://github.com/pytorch/pytorch/issues/27902.