TensorRT-RTX 1.5 Release Notes#
These Release Notes apply to x86 Linux and Windows users. This release includes several fixes from the previous TensorRT-RTX releases and additional changes.
Key Features and Enhancements#
Operator Support
This release adds support for additional ONNX operators:
TensorRT-RTX now supports the
RoiAlignONNX operator.
Platform Support
This release expands platform availability:
An experimental build for DGX Spark and Linux SBSA is now available.
GPU Latency Optimizations
This release improves GPU latency with the following optimizations:
GEMV kernel performance has been improved for models compiled with dynamic input shapes.
Convolution performance has been improved on the
sm_121architecture (NVIDIA GB10 / CUDA compute capability 12.1).CPU overhead between kernel launches has been reduced.
Kernel fusion coverage has been expanded for models compiled with dynamic input shapes.
Just-in-time kernel generation has been improved, expanding coverage for additional convolution variants and runtime fusion patterns.
Breaking ABI and API Changes#
Warning
TensorRT-RTX 1.5 is not ABI-compatible with TensorRT-RTX 1.4. To upgrade, you must update your headers and rebuild your application against the 1.5 SDK.
The following changes apply to this release. ABI Changes affect the binary interface and require an application rebuild against the 1.5 SDK; API Changes break source-level compatibility and require code modifications before rebuilding.
ABI Changes#
These changes affect the binary interface. Source code typically compiles unchanged after picking up the new headers, but applications must be rebuilt and re-linked against the 1.5 SDK.
Removed Destructor and Constructor Symbols
The following types no longer export destructor or constructor symbols:
nvinfer1::ILoggernvinfer1::ILoggerFindernvinfer1::IRuntimeCachenvinfer1::ITimingCachenvinfer1::IProfiler
API Changes#
These changes affect source-level compatibility. Applications using the affected APIs must update their source code before rebuilding against the 1.5 SDK.
Return Type Changes
nvinfer1::IBuilderConfig::setMaxAuxStreams(int32_t)— return type changed fromvoidtobool.
Removed Previously Deprecated APIs
The following APIs were previously deprecated and have been removed in this release:
nvinfer1::NetworkDefinitionCreationFlagkEXPLICIT_BATCH
nvinfer1::IRuntime:deserializeCudaEngine(IStreamReader&)
nvinfer1::IRefitter:setDynamicRange(char const*, float, float)getDynamicRangeMin(char const*)getDynamicRangeMax(char const*)getTensorsWithDynamicRange(int32_t, char const**)
nvinfer1::IOptimizationProfile:setShapeValues(char const*, OptProfileSelector, int32_t const*, int32_t)getShapeValues(char const*, OptProfileSelector)
nvinfer1::ICudaEngine:getProfileTensorValues(char const*, int32_t, OptProfileSelector)getDeviceMemorySizeForProfile(int32_t)setWeightStreamingBudget(int64_t)getWeightStreamingBudget()getMinimumWeightStreamingBudget()hasImplicitBatchDimension()
nvinfer1::IExecutionContext:allInputShapesSpecified()
nvinfer1::INetworkDefinition:addFill(Dims const&, FillOperation)
nvinfer1::IBuilder:platformHasFastFp16()platformHasFastInt8()platformHasTf32()
nvinfer1::plugin:CodeTypeSSD(class removed)
nvinfer1::IPluginRegistry:registerCreator(IPluginCreator&, AsciiChar const* const)getPluginCreatorList(int32_t* const)getPluginCreator(AsciiChar const* const, AsciiChar const* const, AsciiChar const* const)deregisterCreator(IPluginCreator const&)
nvinfer1::IPluginV2Ext:isOutputBroadcastAcrossBatch(int32_t, bool const*, int32_t)canBroadcastInputAcrossBatch(int32_t)
nvonnxparser::IParser:supportsModel(void const*, size_t, SubGraphCollection_t&, char const*)parseWithWeightDescriptors(void const*, size_t)
Compatibility#
For comprehensive platform compatibility, hardware requirements, and feature availability information, refer to the TensorRT-RTX Support Matrix. The support matrix provides detailed information about supported operating systems, CUDA versions, GPU architectures, precision modes, compiler requirements, and ONNX operator support for TensorRT-RTX 1.5.
Limitations#
Warning
Timing cache deprecated: Using
ITimingCacheforces a GPU query during ahead-of-time compilation and has no effect on the built engine. The APIs were deprecated in 1.2 and will be removed in a future update.Engines are not forward-compatible: TensorRT-RTX engines must be rebuilt when upgrading the TensorRT-RTX runtime library.
Using a timing cache through
ITimingCacheand related APIs forces the ahead-of-time compilation step to query your system for a GPU (that is, prevents use of CPU-only ahead-of-time compilation). This has been true since the initial release (version 1.0). The timing cache has no effect on the built engine in TensorRT-RTX. The timing cache APIs were deprecated in 1.2 and will be removed in a future update. They remain in place to avoid breaking source or binary compatibility. Applications should stop using timing cache APIs in preparation for their removal.If the cache size grows too large (larger than 100 MB), it might require more overhead to serialize and deserialize from disk. If it negatively affects performance, delete the cache file and recreate one.
TensorRT-RTX engines are not forward-compatible with other versions of the TensorRT-RTX runtime. Ensure that any TensorRT-RTX engines you produce are run using the runtime from the same version of TensorRT-RTX that was used to generate the engine.
While TensorRT-RTX supports Turing (CUDA compute capability 7.5), a TensorRT-RTX engine created with default Compute Capability settings will produce an engine with support for Ampere and later GPUs, therefore excluding Turing. It is recommended to build a separate engine specifically for Turing to achieve the best performance. Creating a single engine with support for Turing and later GPUs will lead to less performant inference on the Ampere and later GPUs due to technical limitations of the engine format.
Deprecated and Removed Features#
Several previously deprecated APIs have been removed in this release. Refer to API Changes for the complete symbol-level list.
Fixed Issues#
Fixed an internal compiler failure that prevented YOLO ONNX models (for example, YOLOv26m and YOLOv26m-seg) from building on Turing (
sm_75) GPUs. (Reported in GitHub issue #33.)Fixed an internal builder error reporting
target_sm '101' does not have arch kind assignedthat could occur when targeting newer compute capabilities.Fixed an execution failure for models that use a
Convolutionlayer immediately followed by aGeLUoperation. The issue affected all GPU architectures.Fixed a
MyelinCheckExceptionthrown byICudaEngine::createExecutionContextduring JIT initialization for certain FP16 dynamic-shape ONNX models (for example, Stable Diffusion XL UNet and DaVinci Resolve SpeedWarp). The engine built and deserialized successfully, but execution-context creation failed; this affected multiple GPU architectures and was not hardware-specific.Fixed an accuracy regression affecting some models built with dynamic input shapes, caused by
eng8matmul kernels not reloading correctly from the kernel cache. These kernels are now excluded as candidates when dynamic shapes are in use.Enabled depthwise convolutions and deconvolutions in BF16 precision, removing a known limitation from TensorRT-RTX 1.4.
Enabled 3D deconvolutions with groups and padding, removing a known limitation from TensorRT-RTX 1.4.
Fixed a
MyelinCheckExceptionthrown byICudaEngine::createExecutionContextduring JIT initialization for certain models containing static FP8 convolutions.Fixed a performance regression where specialized (tuned) cuDNN kernels were not selected from the kernel cache after an
IExecutionContextwas destroyed and recreated against the same engine. This affected applications that hold a persistent engine while creating short-lived execution contexts (for example, per-request inference servers).Fixed a performance regression affecting certain 1D convolution layers with large kernel sizes.
Fixed an issue on Windows where runtime compilation could occasionally fail when the system temporary folder path exceeded the
MAX_PATHlimit (260 characters).Fixed multi-threaded compilation crashes that could occur when building models containing convolutions.
Known Issues#
Functional
NonMaxSuppression,NonZero, andMultinomiallayers are not supported.Only the WDDM driver for Windows is supported. The TCC driver for Windows (refer to Tesla Compute Cluster (TCC)) is unsupported and may fail with the following error.
[E] Error[1]: [defaultAllocator.cpp::nvinfer1::internal::DefaultAllocator::allocateAsync::48] Error Code 1: Cuda Runtime (operation not supported)
For instructions on changing the driver mode, refer to the Nsight Visual Studio Edition documentation.
When using TensorRT-RTX with the PyCUDA library in Python, use
import pycuda.autoprimaryctxinstead ofimport pycuda.autoinitto avoid device conflicts.Convolutions and deconvolutions with both non-unit strides and dilations are not supported for all precisions. Non-unit strided convolutions and deconvolutions, and non-unit dilated convolutions and deconvolutions are supported.
Build may fail when a
MatrixMultiply,Convolution, orDeconvolutionop uses FP32 inputs and outputs and is followed by a cast to a lower-precision type (for example, FP16 or FP8). Affects all GPU architectures; a fix is expected in a future TensorRT-RTX update.On Windows, the following symbols in
tensorrt_rtx_1_5.dllshould not be used and will be removed in the future:?disableInternalBuildFlags@nvinfer1@@YAXAEAVINetworkDefinition@1@_K@Z?enableInternalBuildFlags@nvinfer1@@YAXAEAVINetworkDefinition@1@_K@Z?getInternalBuildFlags@nvinfer1@@YA_KAEBVINetworkDefinition@1@@Z?setDebugOutput@nvinfer1@@YAXAEAVIExecutionContext@1@PEAVIDebugOutput@1@@Z?setInternalBuildFlags@nvinfer1@@YAXAEAVINetworkDefinition@1@_K@Znvinfer1DisableInternalBuildFlagsnvinfer1EnableInternalBuildFlagsnvinfer1GetInternalBuildFlagsnvinfer1SetInternalBuildFlags
Performance
Use of the CPU-only ahead-of-time (AOT) feature can lead to reduced performance for some models. Affected applications achieve the best performance if they instead perform ahead-of-time compilation on-device, targeted to the specific end-user machine. This can be done with the
--useGPUflag for thetensorrt_rtxbinary, or when using the APIs by setting the compute capabilities to contain onlykCURRENTusingIBuilderConfig::setComputeCapability(). Measure performance with both approaches to determine which is best for your application. This performance discrepancy is targeted for resolution in a future release.Performance optimization in this release prioritizes 16-bit floating point types; such models frequently achieve throughput using TensorRT-RTX that is very close to that achieved with TensorRT. Models that heavily use 32-bit floating point types still see improvement, but performance tends to be lower than that achieved using TensorRT. Performance across additional models and data types is expected to improve in future versions of TensorRT-RTX.
Convolutions and deconvolutions with large filter sizes may have degraded performance. Performance improvements for these cases are planned for a future release.