Triton Inference Server Release 19.10

The TensorRT Inference Server container image, release 19.10, is available on NGC and is open source on GitHub.

Contents of the Triton inference server container

The TensorRT Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tensorrtserver.

Driver Requirements

Release 19.10 is based on NVIDIA CUDA 10.1.243, which requires NVIDIA Driver release 418.xx. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 396, 384.111+ or 410. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

GPU Requirements

Release 19.10 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the Pascal, Volta, and Turing families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.
  • The inference server container image version 19.10 is based on NVIDIA TensorRT Inference Server 1.7.0, TensorFlow 1.14.0, ONNX Runtime 0.4.0, and PyTorch 1.3.0.
  • A Client SDK container is now provided on NGC in addition to the inference server container. The client SDK container includes the client libraries and examples.
  • Latest version of NVIDIA cuDNN 7.6.4
  • TensorRT optimization may now be enabled for any TensorFlow model by enabling the feature in the optimization section of the model configuration.
  • The ONNXRuntime backend now includes the TensorRT and Open Vino execution providers. These providers are enabled in the optimization section of the model configuration.
  • Automatic configuration generation (--strict-model-config=false) now works correctly for TensorRT models with variable-sized inputs and/or outputs.
  • Multiple model repositories may now be specified on the command line. Optional command-line options can be used to explicitly load specific models from each repository.
  • Ensemble models are now pruned dynamically so that only models needed to calculate the requested outputs are executed.
  • The example clients now include a simple Go example that uses the GRPC API.
  • Ubuntu 18.04 with September 2019 updates

Known Issues

  • In TensorRT 6.0.1, reformat-free I/O is not supported.
  • Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.