Triton Inference Server Release 21.04

The Triton Inference Server container image, release 21.04, is available on NGC and is open source on GitHub.

Contents of the Triton Inference Server container

The Triton Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tritonserver.

The container also includes the following:

Driver Requirements

Release 21.04 is based on NVIDIA CUDA 11.3.0, which requires NVIDIA Driver release 465.19.01 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), or 460.27 (or later R460). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

GPU Requirements

Release 21.04 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the NVIDIA Pascal, Volta, Turing, and Ampere Architecture GPU families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.

  • Python backend performance has been increased significantly.
  • ONNX Runtime update to version 1.7.1.
  • Triton Server is now available as a GKE Marketplace Application, see https://github.com/triton-inference-server/server/tree/master/deploy/gke-marketplace-app.
  • The GRPC client libraries now allow compression to be enabled.
  • Ragged batching is now supported for TensorFlow models.
  • For TensorFlow models represented with SavedModel format, it is now possible to choose which graph and signature_def to load. See https://github.com/triton-inference-server/tensorflow_backend/tree/r21.04#parameters.
  • A Helm Chart example is added for AWS. See https://github.com/triton-inference-server/server/tree/master/deploy/aws.
  • The Model Control API is enhanced to provide an option when unloading an ensemble model. The option allows all contained models to be unloaded as part of unloading the ensemble. See https://github.com/triton-inference-server/server/blob/master/docs/protocol/extension_model_repository.md#model-repository-extension.
  • Model reloading using the Model Control API previously resulted in the model being unavailable for a short period of time. This is now fixed so that the model remains available during reloading.
  • Latency statistics and metrics for TensorRT models are fixed. Previously the sum of the "compute input", "compute infer" and "compute output" times accurately indicated the entire compute time but the total time could be incorrectly attributed across the three components. This incorrect attribution is now fixed and all values are now accurate.
  • Error reporting is improved for the Azure, S3 and GCS cloud file system support.
  • Fixed trace support for ensembles. The models contained within an ensemble are now traced correctly.
  • Model Analyzer improvements:
    • Summary report now includes GPU Power usage.
    • Model Analyzer will find the Top N model configuration across multiple models.
  • Refer to the 21.04 column of the Frameworks Support Matrix for container image versions that the 21.04 inference server container is based on.
  • Ubuntu 20.04 with March 2021 updates.

NVIDIA Triton Inference Server Container Versions

The following table shows what versions of Ubuntu, CUDA, Triton Inference Server, and TensorRT are supported in each of the NVIDIA containers for Triton Inference Server. For older container versions, refer to the Frameworks Support Matrix.

Known Issues

  • Triton’s TensorRT support depends on the input-consumed feature of TensorRT. In some rare cases using TensorRT 8.0 and earlier versions, the input-consumed event fires earlier than expected, causing Triton to overwrite input tensors while they are still in use and leading to corrupt input data being used for inference. This situation occurs when the inputs feed directly into a TensorRT layer that is optimized into a ForeignNode in the builder log. If you encounter accuracy issues with your TensorRT model, you can work around the issue by enabling the output_copy_stream option in your model’s configuration (https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto#L816).
  • Compared with the 21.02 and earlier releases, there are backwards incompatible changes in the example Python client shared-memory support library when that library is used for tensors of type BYTES. The utils.serialize_byte_tensor() and utils.deserialize_byte_tensor() functions now return np.object_ numpy arrays where previously they returned np.bytes_ numpy arrays. Code depending on np.bytes_ must be updated. This change was necessary because the np.bytes_ type removes all trailing zeros from each array element and so binary sequences ending in zero(s) could not be represented with the old behavior. Correct usage of the Python client shared-memory support library is shown inhttps://github.com/triton-inference-server/server/blob/r21.03/src/clients/python/examples/simple_http_shm_string_client.py.