Triton Inference Server Release 20.03.1

The Triton Inference Server container image, release 20.03.1, is available on NGC and is open source on GitHub.

Contents of the Triton Inference Server container

The Triton Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tritonserver.

Driver Requirements

Release 20.03.1 is based on NVIDIA CUDA 10.2.89, which requires NVIDIA Driver release 440.33.01. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 396, 384.111+, 410, 418.xx or 440.30. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

GPU Requirements

Release 20.03.1 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the NVIDIA Pascal, Volta, and Turing families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.
  • Updates for KFserving HTTP/REST and GRPC protocols and corresponding Python and C++ client libraries. See the Roadmap section of the README for more information.
  • Updated GRPC version to 1.24.0.
  • Several issues with S3 storage were resolved.
  • Fixed last_inferrence_timestamp value to correctly show the time when inference last occurred for each model.
  • The Caffe2 backend is deprecated. Support for Caffe2 models will be removed in a future release.
  • Refer to the 20.03 column of the Frameworks Support Matrix for container image versions that the 20.03.1 inference server container is based on.
  • The inference server container image version 20.03.1 is additionally based on ONNX Runtime 1.2.0.
  • Ubuntu 18.04 with April 2020 updates.

NVIDIA Triton Inference Server Container Versions

The following table shows what versions of Ubuntu, CUDA, Triton Inference Server, and TensorRT are supported in each of the NVIDIA containers for Triton Inference Server. For older container versions, refer to the Frameworks Support Matrix.

Known Issues

  • The KFServing HTTP/REST and GRPC protocols and corresponding V2 experimental Python and C++ clients are beta quality and are likely to change. Specifically:
    • The data returned by the statistics API will be changing to include additional information.

    • The data returned by the repository index API will be changing to include additional information.
  • The new C API specified in tritonserver.h is beta quality and is likely to change.
  • When using the experimental V2 HTTP/REST C++ client, classification results are not supported for output tensors.
  • When using the experimental V2 perf_client_v2, for high concurrency values perf_client_v2 may not be able to achieve throughput as high as V1 perf_client.
  • TensorRT reformat-free I/O is not supported.
  • Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.