Triton Inference Server Release 19.09

The TensorRT Inference Server container image, release 19.09, is available on NGC and is open source on GitHub.

Contents of the Triton inference server container

The TensorRT Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tensorrtserver.

Driver Requirements

Release 19.09 is based on NVIDIA CUDA 10.1.243, which requires NVIDIA Driver release 418.xx. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 396, 384.111+ or 410. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

GPU Requirements

Release 19.09 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the Pascal, Volta, and Turing families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.
  • The inference server container image version 19.09 is based on NVIDIA TensorRT Inference Server 1.6.0, TensorFlow 1.14.0, ONNX Runtime 0.4.0, and PyTorch 1.2.0.
  • Latest version of NVIDIA cuDNN 7.6.3
  • Latest version of TensorRT 6.0.1
  • Added TensorRT 6 support, which includes support for TensorRT dynamic shapes
  • Shared memory support is added as an alpha feature in this release. This support allows input and output tensors to be communicated via shared memory instead of over the network. Currently only system (CPU) shared memory is supported.
  • Amazon S3 is now supported as a remote file system for model repositories. Use the s3:// prefix on model repository paths to reference S3 locations.
  • The inference server library API is available as a beta in this release. The library API allows you to link against so that you can include all the inference server functionality directly in your application.
  • GRPC endpoint performance improvement. The inference server’s GRPC endpoint now uses significantly less memory while delivering higher performance.
  • The ensemble scheduler is now more flexible in allowing batching and non-batching models to be composed together in an ensemble.
  • The ensemble scheduler will now keep tensors in GPU memory between models when possible. Doing so significantly increases performance of some ensembles by avoiding copies to and from system memory.
  • The performance client, perf_client, now supports models with variable-sized input tensors.
  • Ubuntu 18.04 with August 2019 updates

Known Issues

  • The ONNX Runtime backend could not be updated to the 0.5.0 release due to multiple performance and correctness issues with that release.
  • TensorRT 6:
    • Reformat-free I/O is not supported.
    • Only models that have a single optimization profile are currently supported.
  • Google Kubernetes Engine (GKE) version 1.14 contains a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version to avoid this issue.