TensorRT Release 8.x.x

TensorRT Release 8.0.0 Early Access (EA)

This is the TensorRT 8.0.0 Early Access (EA) release notes and is applicable to Linux x86 users.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 7.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
  • Added support for RedHat/CentOS 8.3, Ubuntu 20.04, and SUSE Linux Enterprise Server 15 Linux distributions. Only a tar file installation is supported on SLES 15 at this time. For more information, refer to the TensorRT Installation Guide.
  • Added Python 3.9 support. Use a tar file installation to obtain the new Python wheel files. For more information, refer to the TensorRT Installation Guide.
  • Added ResizeCoordinateTransformation, ResizeSelector, and ResizeRoundMode; three new enumerations to IResizeLayer, and enhanced IResizeLayer to support more resize modes from TensorFlow, PyTorch, and ONNX. For more information, refer to the IResizeLayer section in the TensorRT Developer Guide.
  • Builder timing cache can be serialized and reused across builder instances. For more information, refer to the Builder Layer Timing Cache and trtexec sections in the TensorRT Developer Guide.
  • Added convolution and fully-connected tactics which support and make use of structured sparsity in kernel weights. This feature can be enabled by setting the kSPARSE_WEIGHTS flag in IBuilderConfig. This feature is only available on NVIDIA Ampere GPUs. For more information, refer to the Structured Sparsity section in the Best Practices For TensorRT Performance guide. (not applicable for Jetson platforms)
  • Added two new layers to the API: IQuantizeLayer and IDequantizeLayer which can be used to explicitly specify the precision of operations and data buffers. ONNX’s QuantizeLinear and DequantizeLinear operators are mapped to these new layers which enables the support for networks trained using Quantization-Aware Training (QAT) methodology. For more information, refer to the Explicit-Quantization, IQuantizeLayer, and IDequantizeLayer sections in the TensorRT Developer Guide and Q/DQ Fusion in the Best Practices For TensorRT Performance guide.
  • Achieved QuartzNet optimization with support of 1D fused depthwise + pointwise convolution kernel to achieve up to 1.8x end-to-end performance improvement on A100. (not applicable for Jetson platforms)
  • Added support for the following ONNX operators: Celu, CumSum, EyeLike, GatherElements, GlobalLpPool, GreaterOrEqual, LessOrEqual, LpNormalization, LpPool, ReverseSequence, and SoftmaxCrossEntropyLoss. For more information, refer to the Supported Ops section in the TensorRT Support Matrix.
  • Added Sigmoid/Tanh INT8 support for DLA. It allows DLA sub-graph with Sigmoid/Tanh to compile with INT8 by auto-upgrade to FP16 internally. For more information, refer to the DLA Supported Layers section in the TensorRT Developer Guide.
  • Added DLA native planar format and DLA native gray-scale format support.
  • Allow to generate reformat-free engine with DLA when EngineCapability is EngineCapability::kDEFAULT.
  • TensorRT now declares API’s with the noexcept keyword to clarify that exceptions must not cross the library boundary. All TensorRT classes that an application inherits from (such as IGpuAllocator, IPluginV2, etc…) must guarantee that methods called by TensorRT do not throw uncaught exceptions, or the behavior is undefined.
  • TensorRT reports errors, along with an associated ErrorCode, via the ErrorRecorder API for all errors. The ErrorRecorder will fallback to the legacy logger reporting, with Severity::kERROR or Severity::kINTERNAL_ERROR, if no error recorder is registered. The ErrorCodes allow recovery in cases where TensorRT previously reported non-recoverable situations.
  • Improved performance of the GlobalAveragePooling operation, which is used in some CNNs like EfficientNet. For transformer based networks with INT8 precision, it’s recommended to use a network which is trained using Quantization Aware Training (QAT) and has IQuantizeLayer and IDequantizeLayer layers in the network definition.
  • TensorRT now supports refit weights via names. For more information, refer to Refitting An Engine in the TensorRT Developer Guide.
  • Refitting performance has been improved. The performance boost can be evident when the weights are large or a large number of weights or layers are updated at the same time.
  • Added a new sample.This sample, engine_refit_onnx_bidaf, builds an engine from the ONNX BiDAF model, and refits the TensorRT engine with weights from the model. The new refit APIs allow users to locate the weights via names from ONNX models instead of layer names and weights roles. For more information, refer to the Refitting An Engine Built From An ONNX Model In Python in the TensorRT Sample Support Guide.
  • Improved performance for the transformer based networks such as BERT and other networks that use Multi-Head Self-Attention.
  • Added cuDNN to the IBuilderConfig::setTacticSources enum. Use of cuDNN as a source of operator implementations can be enabled or disabled using the IBuilderConfig::setTacticSources API function.
  • The following C++ API functions were added:
    • class IDequanzizeLayer
    • class IQuantizeLayer
    • class ITimingCache
    • IBuilder::buildSerializedNetwork()
    • IBuilderConfig::getTimingCache()
    • IBuilderConfig::setTimingCache()
    • IGpuAllocator::reallocate()
    • INetworkDefinition::addDequantize()
    • INetworkDefinition::addQuantize()
    • INetworkDefinition::setWeightsName()
    • IPluginRegistry::deregisterCreator()
    • IRefitter::getMissingWeights()
    • IRefitter::getAllWeights()
    • IRefitter::setNamedWeights()
    • IResizeLayer::getCoordinateTransformation()
    • IResizeLayer::getNearestRounding()
    • IResizeLayer::getSelectorForSinglePixel()
    • IResizeLayer::setCoordinateTransformation()
    • IResizeLayer::setNearestRounding()
    • IResizeLayer::setSelectorForSinglePixel()
    • IScaleLayer::setChannelAxis()
    • enum ResizeCoordinateTransformation
    • enum ResizeMode
    • BuilderFlag::kSPARSE_WEIGHTS
    • TacticSource::kCUDNN
    • TensorFormat::kDLA_HWC4
    • TensorFormat::kDLA_LINEAR
    • TensorFormat::kHWC16
  • The following Python API functions were added:
    • class IDequanzizeLayer
    • class IQuantizeLayer
    • class ITimingCache
    • Builder.build_serialized_network()
    • IBuilderConfig.get_timing_cache()
    • IBuilderConfig.set_timing_cache()
    • IGpuAllocator.reallocate()
    • INetworkDefinition.add_dequantize()
    • INetworkDefinition.add_quantize()
    • INetworkDefinition.set_weights_name()
    • IPluginRegistry.deregister_creator()
    • IRefitter.get_missing_weights()
    • IRefitter.get_all_weights()
    • IRefitter::set_named_weights()
    • IResizeLayer.coordinate_transformation
    • IResizeLayer.nearest_rounding
    • IResizeLayer.selector_for_single_pixel
    • IScaleLayer.channel_axis
    • enum ResizeCoordinateTransformation
    • enum ResizeMode
    • BuilderFlag.SPARSE_WEIGHTS
    • TacticSource.CUDNN
    • TensorFormat.DLA_HWC4
    • TensorFormat.DLA_LINEAR
    • TensorFormat.HWC16

Breaking API Changes

  • Support for Python 2 has been dropped. This means that TensorRT will no longer include wheels for Python 2, and Python samples will not work with Python 2.
  • All API's have been marked as noexcept where appropriate. The IErrorRecorder interface has been fully integrated into the API for error reporting. The Logger is only used as a fallback when the ErrorRecorder is not provided by the user.
  • Callback changes are now marked noexcept, therefore, implementations must also be marked noexcept. TensorRT has never catered to exceptions thrown by callbacks, but this is now captured in the API.
  • Methods that take parameters of type void** where the array of pointers is unmodifiable are now changed to take type void*const*.
  • Dims is now a type alias for class Dims32. Code that forward-declares Dims should forward-declare class Dims32; using Dims = Dims32;.

Compatibility

  • TensorRT 8.0.0 EA has been tested with the following:
  • This TensorRT release supports CUDA:
    Note: There are two TensorRT binary builds for CUDA 11.0 and CUDA 11.3. The build for CUDA 11.3 is compatible with CUDA 11.1 and CUDA 11.2 libraries. For both builds, CUDA driver compatible with the runtime CUDA version is required (see Table 2 here). For the CUDA 11.3 build, driver version 465 or above is suggested for best performance.
  • It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Limitations

  • For QAT networks, TensorRT 8.0 supports per-tensor and per-axis quantization scales for weights. For activations, only per-tensor quantization is supported. Only symmetric quantization is supported and zero-point weights may be omitted or, if zero-points are provided, all coefficients must have a value of zero.
  • Loops and DataType::kBOOL are not supported when the static TensorRT library is used. Performance improvements for transformer based architectures such as BERT will also not be available when using static TensorRT library.
  • When using reformat-free I/O, the extent of a tensor in a vectorized dimension might not be a multiple of the vector length. Elements in a partially occupied vector that are not within the tensor are referred to here as vector-padding. For example:
    • On GPU
      • for input tensors, the application shall set vector-padding elements to zero.
      • for output tensors, the value of vector-padding elements is undefined. In a future release, TensorRT will support setting them to zero.
    • On DLA
      • for input tensors, vector-padding elements are ignored.
      • for output tensors, vector-padding elements are unmodified.
  • When running INT8 networks on DLA using TensorRT, operations must be added to the same subgraph to reduce quantization errors across the subgraph of the network that runs on the DLA by allowing them to fuse and retain higher precision for intermediate results. Breaking apart the subgraph in order to inspect intermediate results by setting the tensors as network output tensors, can result in different levels of quantization errors due to these optimizations being disabled.
  • If both kSPARSE_WEIGHTS and kREFIT flags are set in IBuilderConfig, the convolution layers having structured sparse kernel weights cannot be refitted with new kernel weights which do not have structured sparsity. The IRefitter::setWeights() will print an error and return false in that case.

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.0.0:
  • Deprecation is used to inform developers that some APIs and tools are no longer recommended for use. TensorRT has the following deprecation policy:
    • This policy comes into effect beginning with TensorRT 8.0.
    • Deprecation notices are communicated in the release notes. Deprecated API elements are marked with the TRT_DEPRECATED macro where possible.
    • TensorRT provides a 12-month migration period after the deprecation. For any APIs and tools deprecated in TensorRT 7.x, the 12-month migration period starts from the TensorRT 8.0 GA release date.
    • APIs and tools will continue to work during the migration period.
    • After the migration period ends, we reserve the right to remove the APIs and tools in a future release.
  • IRNNLayer was deprecated in TensorRT 4.0 and has been removed in TensorRT 8.0. IRNNv2Layer was deprecated in TensorRT 7.2.1. IRNNv2Layer has been deprecated in favor of the loop API, however, it is still available for backwards compatibility. For more information about the loop API, refer to the sampleCharRNN sample with the --Iloop option as well as the Working With Loops chapter.
  • IPlugin and IPluginFactory interfaces were deprecated in TensorRT 6.0 and have been removed in TensorRT 8.0. We recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt and IPluginV2IOExt interfaces. For more information, refer to the Migrating Plugins From TensorRT 6.x Or 7.x To TensorRT 8.x.x section.
  • We removed samplePlugin since it was meant to demonstrate the IPluginExt interface, which is no longer supported in TensorRT 8.0.
  • We have deprecated the Caffe Parser and UFF Parser in TensorRT 7.0. They are still tested and functional in TensorRT 8.0, however, we plan to remove the support in the future. Ensure you migrate your workflow to use tf2onnx, keras2onnx or TensorFlow-TensorRT (TF-TRT) for deployment.

    If using UFF, ensure you migrate to the ONNX workflow through enablement of a plugin. ONNX workflow is not dependent on plugin enablement. For plugin enablement of a plugin on ONNX, refer to Estimating Depth with ONNX Models and Custom Layers Using NVIDIA TensorRT.

    Caffe and UFF-specific topics in the Developer Guide have been moved to the Appendix section until removal in the subsequent major release.

  • Interface functions that provided a destroy function are deprecated in TensorRT 8.0. The destructors will be exposed publicly in order for the delete operator to work as expected on these classes.
  • nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_PRECISION is deprecated. Networks that have QuantizeLayer and DequantizeLayer layers will be automatically processed using Q/DQ-processing, which includes explicit-precision semantics. Explicit precision is a network-optimizer constraint that prevents the optimizer from performing precision-conversions that are not dictated by the semantics of the network. For more information, refer to the Working With QAT Networks section in the TensorRT Developer Guide.
  • nvinfer1::IResizeLayer::setAlignCorners and nvinfer1::IResizeLayer::getAlignCorners are deprecated. Use nvinfer1::IResizeLayer::setCoordinateTransformation, nvinfer1::IResizeLayer::setSelectorForSinglePixel and nvinfer1::IResizeLayer::setNearestRounding instead.
  • Destructors for classes with destroy() methods were previously protected. They are now public, enabling use of smart pointers for these classes. The destroy() methods are deprecated.
  • The following C++ API functions, types, and a field, which were previously deprecated, were removed:
    Core Library:
    • DimensionType
    • Dims::Type
    • class DimsCHW
    • class DimsNCHW
    • class IOutputDimensionFormula
    • class IPlugin
    • class IPluginFactory
    • class IPluginLayer
    • class IRNNLayer
    • IBuilder::getEngineCapability()
    • IBuilder::allowGPUFallback()
    • IBuilder::buildCudaEngine()
    • IBuilder::canRunOnDLA()
    • IBuilder::createNetwork()
    • IBuilder::getAverageFindIterations()
    • IBuilder::getDebugSync()
    • IBuilder::getDefaultDeviceType()
    • IBuilder::getDeviceType()
    • IBuilder::getDLACore()
    • IBuilder::getFp16Mode()
    • IBuilder::getHalf2Mode()
    • IBuilder::getInt8Mode()
    • IBuilder::getMaxWorkspaceSize()
    • IBuilder::getMinFindIterations()
    • IBuilder::getRefittable()
    • IBuilder::getStrictTypeConstraints()
    • IBuilder::isDeviceTypeSet()
    • IBuilder::reset()
    • IBuilder::resetDeviceType()
    • IBuilder::setAverageFindIterations()
    • IBuilder::setDebugSync()
    • IBuilder::setDefaultDeviceType()
    • IBuilder::setDeviceType()
    • IBuilder::setDLACore()
    • IBuilder::setEngineCapability()
    • IBuilder::setFp16Mode()
    • IBuilder::setHalf2Mode()
    • IBuilder::setInt8Calibrator()
    • IBuilder::setInt8Mode()
    • IBuilder::setMaxWorkspaceSize()
    • IBuilder::setMinFindIterations()
    • IBuilder::setRefittable()
    • IBuilder::setStrictTypeConstraints()
    • ICudaEngine::getWorkspaceSize()
    • IMatrixMultiplyLayer::getTranspose()
    • IMatrixMultiplyLayer::setTranspose()
    • INetworkDefinition::addMatrixMultiply()
    • INetworkDefinition::addPlugin()
    • INetworkDefinition::addPluginExt()
    • INetworkDefinition::addRNN()
    • INetworkDefinition::getConvolutionOutputDimensionsFormula()
    • INetworkDefinition::getDeconvolutionOutputDimensionsFormula()
    • INetworkDefinition::getPoolingOutputDimensionsFormula()
    • INetworkDefinition::setConvolutionOutputDimensionsFormula()
    • INetworkDefinition::setDeconvolutionOutputDimensionsFormula()
    • INetworkDefinition::setPoolingOutputDimensionsFormula()
    • ITensor::getDynamicRange()
    • TensorFormat::kNHWC8
    • TensorFormat::NCHW
    • TensorFormat::kNC2HW2
    Plugins: The following plugin classes were removed:
    • class INvPlugin
    • createLReLUPlugin()
    • createClipPlugin()
    • PluginType
    • struct SoftmaxTree

    Plugin interface methods: For plugins based on IPluginV2DynamicExt and IPluginV2IOExt, certain methods with legacy function signatures (derived from IPluginV2 and IPluginV2Ext base classes) which were deprecated and marked for removal in TensorRT 8.0 will no longer be available. Plugins using these interface methods must stop using them or implement the versions with updated signatures, as applicable.

    Unsupported plugin methods removed in TensorRT 8.0:
    • IPluginV2DynamicExt::canBroadcastInputAcrossBatch()
    • IPluginV2DynamicExt::isOutputBroadcastAcrossBatch()
    • IPluginV2DynamicExt::getTensorRTVersion()
    • IPluginV2IOExt::configureWithFormat()
    • IPluginV2IOExt::getTensorRTVersion()
    Use updated versions for supported plugin methods:
    • IPluginV2DynamicExt::configurePlugin()
    • IPluginV2DynamicExt::enqueue()
    • IPluginV2DynamicExt::getOutputDimensions()
    • IPluginV2DynamicExt::getWorkspaceSize()
    • IPluginV2IOExt::configurePlugin()
    Use newer methods for the following:
    • IPluginV2DynamicExt::supportsFormat() has been removed,use IPluginV2DynamicExt::supportsFormatCombination() instead.
    • IPluginV2IOExt::supportsFormat() has been removed,use IPluginV2IOExt::supportsFormatCombination() instead.
    Caffe Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • setPluginFactory()
    • setPluginFactoryExt()
    UFF Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • setPluginFactory()
    • setPluginFactoryExt()
  • The following Python API functions, which were previously deprecated, were removed:
    Core library:
    • class DimsCHW
    • class DimsNCHW
    • class IPlugin
    • class IPluginFactory
    • class IPluginLayer
    • class IRNNLayer
    • Builder.build_cuda_engine()
    • Builder.average_find_iterations
    • Builder.debug_sync
    • Builder.fp16_mode
    • IBuilder.int8_mode
    • Builder.max_workspace_size
    • Builder.min_find_iterations
    • Builder.refittable
    • Builder.strict_type_constraints
    • ICudaEngine.max_workspace_size
    • IMatrixMultiplyLayer.transpose0
    • IMatrixMultiplyLayer.transpose0
    • INetworkDefinition.add_matrix_multiply_deprecated()
    • INetworkDefinition.add_plugin()
    • INetworkDefinition.add_plugin_ext()
    • INetworkDefinition.add_rnn()
    • INetworkDefinition.convolution_output_dimensions_formula
    • INetworkDefinition.deconvolution_output_dimensions_formula
    • INetworkDefinition.pooling_output_dimensions_formula
    • ITensor.get_dynamic_range()
    • Dims.get_type()
    • TensorFormat.HWC8
    • TensorFormat.NCHW
    • TensorFormat.NCHW2
    Caffe Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • setPluginFactory()
    • setPluginFactoryExt()
    UFF Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • setPluginFactory()
    • setPluginFactoryExt()
    Plugins:
    • class INvPlugin
    • createLReLUPlugin()
    • createClipPlugin()
    • PluginType
    • struct SoftmaxTree
  • The following Python API functions were removed:
    Core library:
    • class DimsCHW
    • class DimsNCHW
    • class IPlugin
    • class IPluginFactory
    • class IPluginLayer
    • class IRNNLayer
    • Builder.build_cuda_engine()
    • Builder.average_find_iterations
    • Builder.debug_sync
    • Builder.fp16_mode
    • IBuilder.int8_mode
    • Builder.max_workspace_size
    • Builder.min_find_iterations
    • Builder.refittable
    • Builder.strict_type_constraints
    • ICudaEngine.max_workspace_size
    • IMatrixMultiplyLayer.transpose0
    • IMatrixMultiplyLayer.transpose0
    • INetworkDefinition.add_matrix_multiply_deprecated()
    • INetworkDefinition.add_plugin()
    • INetworkDefinition.add_plugin_ext()
    • INetworkDefinition.add_rnn()
    • INetworkDefinition.convolution_output_dimensions_formula
    • INetworkDefinition.deconvolution_output_dimensions_formula
    • INetworkDefinition.pooling_output_dimensions_formula
    • ITensor.get_dynamic_range()
    • Dims.get_type()
    • TensorFormat.HWC8
    • TensorFormat.NCHW
    • TensorFormat.NCHW2
    Caffe Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • CaffeParser.plugin_factory
    • CaffeParser.plugin_factory_ext
    UFF Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • UffParser.plugin_factory
    • UffParser.plugin_factory_ext

Fixed Issues

  • Improved build times for convolution layers with dynamic shapes and large range of leading dimensions.
  • TensorRT 8.0 no longer requires libcublas.so.* to be present on your system when running an application which was linked with the TensorRT static library. The TensorRT static library now requires cuBLAS and other dependencies to be linked at link time and will no longer open these libraries using dlopen().
  • TensorRT 8.0 no longer requires an extra Identity layer between the ElementWise and the Constant whose rank is > 4. For TensorRT 7.x versions, cases like Convolution and FullyConnected with bias where ONNX decomposes the bias to ElementWise, there was a fusion which didn’t support per element scale. We previously inserted an Identity to workaround this.
  • There was a known performance regression compared to TensorRT 7.1 for Convolution layers with kernel size greater than 5x5. For example, it could lead up to 35% performance regression of the VGG16 UFF model compared to TensorRT 7.1. This issue has been fixed in this release.
  • When running networks such as Cortana, LSTM Peephole, MLP, and Faster RCNN, there was a 5% to 16% performance regression on GA102 devices and a 7% to 36% performance regression on GA104 devices. This issue has been fixed in this release. (not applicable for Jetson platforms)
  • Some RNN networks such as Cortana with FP32 precision and batch size of 8 or higher had a 20% performance loss with CUDA 11.0 or higher compared to CUDA 10.2. This issue has been fixed in this release.

Announcements

  • TensorRT 8.0 will be the last TensorRT release that will provide support for Ubuntu 16.04. This also means TensorRT 8.0 will be the last TensorRT release that will support Python 3.5.
  • Python samples use a unified data downloading workflow. Each sample has a YAML (download.yml) describing the data files that are required to download via a link before running the sample, if any. The download tool parses the YAML and downloads the data files. All other sample code assumes that the data has been downloaded before the code is invoked. An error will be raised if the data is not correctly downloaded. Refer to the Python sample documentation for more information.

Known Issues

  • The diagram in IRNNv2Layer is incorrect. This will be fixed in a future release.
  • There is a known issue that graph capture may fail in some cases for IExecutionContext::enqueue() and IExecutionContext::enqueueV2(). For more information, refer to the documentation for IExecutionContext::enqueueV2(), including how to work around this issue.
  • Some fusions are not enabled when the TensorRT static library is used. This means there is a performance loss of around 10% for networks like BERT and YOLO3 when linking with the static library compared to the dynamic library. The performance loss depends on precision used and batch size and it can be up to 60% in some cases.
  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit those weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().
  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.
  • There is a known performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on Maxwell and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation on Nano.
  • There is an up to 15% performance regression compared to TensorRT 7.2.3 for QuartzNet variants on Volta GPUs.
  • There is an up to 150% performance regression compared to TensorRT 7.2.3 for 3D U-Net variants on NVIDIA Ampere GPUs if the workspace size is limited to 1GB. Enlarging the workspace size (for example, to 2GB) can workaround this issue.
  • There is a known issue that TensorRT selects kLINEAR format when the user uses reformat-free I/O with vectorized formats and with input/output tensors which have only 3 dimensions. The workaround is to add an additional dimension to the tensors with size 1 to make them 4 dimensional tensors.
  • CuTensor based algorithms on TensorRT 8.0 EA are known to have significant performance regressions due to an issue with the CUDA 11.3 compiler (5x-10x slower than CUDA 11.0 builds). This is due to a compiler regression and the performance should be recovered with a future CUDA release.
  • When running TensorRT 8.0.0 with cuDNN 8.2.0, there is a known performance regression for the deconvolution layer compared to running with previous cuDNN releases. For example, some deconvolution layers can have up to 7x performance regression on Turing GPUs compared to running with cuDNN 8.0.4. This will be fixed in a future cuDNN release.
  • There is a known false alarm reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommended way for suppressing the false alarm is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool.
    {
       Ignore the dlopen false alarm.
       Memcheck:Leak
       ...
       fun:_dl_open
       ...
    }
    
  • There is an up to 8% performance regression compared to TensorRT 7.2.3 for DenseNet variants on Volta GPUs.
  • There is an up to 24% performance regression compared to TensorRT 7.2.3 for networks containing Slice layers on Turing GPUs.
  • While using the TensorRT static library, users are still required to have the cuDNN/cuBLAS dynamic libraries installed at runtime. This issue will be resolved in the GA release so that cuDNN/cuBLAS static libraries will always be used instead.
  • An issue was discovered while compiling the TensorRT samples using the TensorRT static libraries with a GCC version older than 5.x. When using RHEL/CentOS 7.x, you may observe a crash with the error message munmap_chunk(): invalid pointer if the patch below is not applied. More details regarding this issue with a workaround for your own application can be found in the TensorRT Sample Support Guide.
    --- a/samples/Makefile.config
    +++ b/samples/Makefile.config
    @@ -331,13 +331,13 @@ $(OUTDIR)/$(OUTNAME_DEBUG) : $(DOBJS) $(CUDOBJS)
     else
     $(OUTDIR)/$(OUTNAME_RELEASE) : $(OBJS) $(CUOBJS)
     	$(ECHO) Linking: $@
    -	$(AT)$(CC) -o $@ $^ $(LFLAGS) -Wl,--start-group $(LIBS) -Wl,--end-group
    +	$(AT)$(CC) -o $@ $(LFLAGS) -Wl,--start-group $(LIBS) $^ -Wl,--end-group
     	# Copy every EXTRA_FILE of this sample to bin dir
     	$(foreach EXTRA_FILE,$(EXTRA_FILES), cp -f $(EXTRA_FILE)$(OUTDIR)/$(EXTRA_FILE); )
     
     $(OUTDIR)/$(OUTNAME_DEBUG) : $(DOBJS) $(CUDOBJS)
     	$(ECHO) Linking: $@
    -	$(AT)$(CC) -o $@ $^ $(LFLAGSD) -Wl,--start-group $(DLIBS) -Wl,--end-group
    +	$(AT)$(CC) -o $@ $(LFLAGSD) -Wl,--start-group $(DLIBS) $^ -Wl,--end-group
     endif
     
     $(OBJDIR)/%.o: %.cpp
    
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS.