Triton Inference Server Release 19.07
The TensorRT Inference Server container image, release 19.07, is available on NGC and is open source on GitHub.
Contents of the Triton inference server container
The TensorRT Inference Server Docker image contains the inference server executable and related shared libraries in /opt/tensorrtserver.
- Ubuntu 18.04
- NVIDIA CUDA 10.1.168 including cuBLAS 10.2.0.168
- NVIDIA cuDNN 7.6.1
- NVIDIA NCCL 2.4.7 (optimized for NVLink™ )
- MLNX_OFED +3.4
- OpenMPI 3.1.3
- TensorRT 5.1.5
Driver Requirements
Release 19.07 is based on NVIDIA CUDA 10.1.168, which requires NVIDIA Driver release 418.67. However, if you are running on Tesla (Tesla V100, Tesla P4, Tesla P40, or Tesla P100), you may use NVIDIA driver release 384.111+ or 410. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.
GPU Requirements
Release 19.07 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the Pascal, Volta, and Turing families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.
Key Features and Enhancements
- The inference server container image version 19.07 is based on NVIDIA TensorRT Inference Server 1.4.0, TensorFlow 1.14.0, ONNX Runtime 0.4.0, and Caffe2 0.8.2.
- Added libtorch as a new backend. PyTorch models manually decorated or automatically traced to produce TorchScript can now be run directly by the inference server.
- Build system converted from bazel to CMake. The new CMake-based build system is more transparent, portable and modular.
- To simplify the creation of custom backends, a Custom Backend SDK and improved documentation is now available.
- Improved AsyncRun API in C++ and Python client libraries.
- perf_client can now use user-supplied input data (previously used random or zero input data).
- perf_client now reports latency at multiple confidence percentiles (p50, p90, p95, p99) as well as a user-supplied percentile that is also used to stabilize latency results.
- Improvements to automatic model configuration creation (--strict-model-config=false).
- C++ and Python client libraries now allow additional HTTP headers to be specified when using the HTTP protocol.
- Latest version of NVIDIA cuDNN 7.6.1
- Latest version of MLNX_OFED +3.4
- Latest version of Ubuntu 18.04
Known Issues
There are no known issues in this release.