Triton Inference Server Release 19.11

The TensorRT Inference Server container image, release 19.11, is available on NGC and is open source on GitHub.

Contents of the Triton inference server container

The TensorRT Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tensorrtserver.

Driver Requirements

Release 19.11 is based on NVIDIA CUDA 10.2.89, which requires NVIDIA Driver release 440.30. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 396, 384.111+, 410 or 418.xx. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

GPU Requirements

Release 19.11 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the Pascal, Volta, and Turing families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This Inference Server release includes the following key features and enhancements.
  • Refer to the Frameworks Support Matrix for container image versions that the 19.11 inference server container is based on.
  • The inference server container image version 19.11 is additionally based on ONNX Runtime 1.1.1.
  • Shared-memory support is expanded to include CUDA shared memory.
  • Improve efficiency of pinned-memory used for ensemble models.
  • The perf_client application has been improved with easier-to-use command-line arguments (while maintaining compatibility with existing arguments).
  • Support for string tensors added to perf_client.
  • Documentation contains a new Optimization section discussing some common optimization strategies and how to use perf_client to explore these strategies.
  • Latest version of NVIDIA CUDA 10.2.89 including cuBLAS 10.2.2.89
  • Latest version of NVIDIA cuDNN 7.6.5
  • Latest version of NVIDIA NCCL 2.5.6
  • Ubuntu 18.04 with October 2019 updates

Deprecated Features

  • The asynchronous inference API has been modified in the C++ and Python client libraries.
    • In the C++ library:
      • The non-callback version of the AsyncRun function is removed.

      • The GetReadyAsyncRequest function is removed.

      • The signature of the GetAsyncRunResults function was changed to remove the is_ready and wait arguments.

    • In the Python library:
      • The non-callback version of the async_run function was removed.

      • The get_ready_async_request function was removed.

      • The signature of the get_async_run_results function was changed to remove the wait argument.

Known Issues

  • The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
    • The signature of the CustomGetNextInputV2Fn_t function adds the memory_type_id argument.

    • The signature of the CustomGetOutputV2Fn_t function adds the memory_type_id argument.

  • The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
    • The signature and operation of the TRTSERVER_ResponseAllocatorAllocFn_t function has changed. See src/core/trtserver.h for a description of the new behavior.

    • The signature of the TRTSERVER_InferenceRequestProviderSetInputData function adds the memory_type_id argument.

    • The signature of the TRTSERVER_InferenceResponseOutputData function add the memory_type_id argument.

  • TensorRT reformat-free I/O is not supported.

  • Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.