Triton Inference Server Release 19.12

The TensorRT Inference Server container image, release 19.12, is available on NGC and is open source on GitHub.

Contents of the Triton inference server container

The TensorRT Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tensorrtserver.

Driver Requirements

Release 19.12 is based on NVIDIA CUDA 10.2.89, which requires NVIDIA Driver release 440.30. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 396, 384.111+, 410, 418.xx or 440.30.. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

GPU Requirements

Release 19.12 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the Pascal, Volta, and Turing families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.
  • Refer to the Frameworks Support Matrix for container image versions that the 19.12 inference server container is based on.
  • The inference server container image version 19.12 is additionally based on ONNX Runtime 1.1.1.
  • The model configuration now includes a model warmup option. This option provides the ability to tune and optimize the model before inference requests are received, avoiding initial inference delays. This option is especially useful for frameworks like TensorFlow that perform network optimization in response to the initial inference requests. Models can be warmed-up with one or more synthetic or realistic workloads before they become ready in the server
  • An enhanced sequence batcher now has multiple scheduling strategies. A new Oldest strategy integrates with the dynamic batcher to enable improved inference performance for models that don’t require all inference requests in a sequence to be routed to the same batch slot.
  • The perf_client now has an option to generate requests using a realistic poisson distribution or a user provided distribution.
  • A new repository API (available in the shared library API, HTTP, and gRPC) returns an index of all models available in the model repositories) visible to the server. This index can be used to see what models are available for loading onto the server.
  • The server status returned by the server status API now includes the timestamp of the last inference request received for each model.
  • Inference server tracing capabilities are now documented in the Optimization section of the User Guide. Tracing support is enhanced to provide trace for ensembles and the contained models.
  • A community contributed Dockerfile is now available to build the TensorRT Inference Server clients on CentOS.
  • Ubuntu 18.04 with November2019 updates

Known Issues

  • The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
    • The signature of the CustomGetNextInputV2Fn_t function adds the memory_type_id argument.

    • The signature of the CustomGetOutputV2Fn_t function adds the memory_type_id argument.

  • The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
    • The signature and operation of the TRTSERVER_ResponseAllocatorAllocFn_t function has changed. See src/core/trtserver.h for a description of the new behavior.

    • The signature of the TRTSERVER_InferenceRequestProviderSetInputData function adds the memory_type_id argument.

    • The signature of the TRTSERVER_InferenceResponseOutputData function add the memory_type_id argument.

  • TensorRT reformat-free I/O is not supported.

  • Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.