Release Notes#
These Release Notes describe the key features, software enhancements and improvements, and known issues for the TensorRT for RTX (TensorRT-RTX) release product package.
TensorRT-RTX 1.1#
These are the NVIDIA TensorRT-RTX 1.1 Release Notes.
Key Features and Enhancements
This TensorRT-RTX release includes the following key features and enhancements when compared to NVIDIA TensorRT-RTX 1.0.
Added the
IRuntime::getEngineValidity()
API to programmatically check whether a TensorRT-RTX engine file is valid on the current system or needs to be rebuilt due to incompatibilities in the software version, compute capability, and so on. This API only checks the file header and therefore does not require loading the entire engine file into memory, which results in a more efficient check.Compilation time has been greatly improved, particularly for models with many memory-bound kernels. On average a 1.5x improvement is observed across a variety of model architectures.
Compatibility
This TensorRT-RTX release supports NVIDIA CUDA 12.9.
TensorRT-RTX supports both Windows and Linux platforms. Linux build is expected to work on x86-64 architecture with Rocky Linux 8.9, Rocky Linux 9.3, Ubuntu 20.04, Ubuntu 22.04, Ubuntu 24.04, and SLES 15. However, only platforms listed in the Support Matrix are officially supported in this release.
Limitations
Using a timing cache via
ITimingCache
and related APIs forces the ahead-of-time compilation step to query your system for a GPU (that is, prevents use of CPU-only AoT compilation). This was true in version 1.0 as well. The timing cache has no effect on the built engine in TensorRT-RTX. We intend to deprecate the timing cache APIs in the next release and then remove them in a future update. They remain in place to avoid breaking source or binary compatibility. Applications should start removing usage of timing cache APIs in preparation for their removal.If the cache size grows too large (larger than 100MB), it may require more overhead to de/serialize to and from disk. If it negatively affects performance, delete the cache file and recreate one.
TensorRT-RTX engines are not forward-compatible with other versions of the TensorRT-RTX runtime. Ensure that any TensorRT-RTX engines you produce are run using the runtime from the same version of TensorRT-RTX which was used to generate the engine.
While TensorRT-RTX supports Turing (CUDA compute capability 7.5), a TensorRT-RTX engine created with default Compute Capability settings will produce an engine with support for Ampere and later GPUs, therefore excluding Turing. It is recommended to build a separate engine specifically for Turing to achieve the best performance. Creating a single engine with support for Turing and later GPUs will lead to less performant inference on the Ampere and later GPUs due to technical limitations of the engine format.
Deprecated and Removed Features
The following features have been deprecated or removed in TensorRT for RTX 1.1.
ITimingCache
APIs will be deprecated in the next version of TensorRT-RTX. Refer to the Limitations section for more information.
Fixed Issues
Parsing custom operators from ONNX would cause the software to crash. Custom operators are not supported, but will now report an error instead of crashing.
Builder flags
kREFIT
orkREFIT_IDENTICAL
would cause engine build failure when normalization layers exist in network definition.When building an engine with compute capability SM75, MatrixMultiply layers with batch size greater than 1 would fail in execution.
A model containing hardware-specific data types will fail to be built at AOT time if the data types are not supported on target compute capabilities specified through TensorRT-RTX API.
Specific dynamic shape cases (with very large filter sizes and custom paddings) would cause build failure at JIT time.
Dynamic shape networks with if/else conditional blocks, that branch based on tensor shapes, resulted in segmentation fault in some cases.
Repeated serialization and deserialization of engines caused segmentation faults.
Specifying
kCurrent
compute capability on some platforms, which should lead to higher performance, lagged behind the default compute capability.Creating an execution context would create and immediately destroy many threads.
Known Issues
Functional
NonMaxSuppression
,NonZero
, andMultinomial
layers are not supported.Only the WDDM driver for Windows is supported. The TCC driver for Windows (refer to Tesla Compute Cluster (TCC)) is unsupported and may fail with the following error.
[E] Error[1]: [defaultAllocator.cpp::nvinfer1::internal::DefaultAllocator::allocateAsync::48] Error Code 1: Cuda Runtime (operation not supported)
For instructions on changing the driver mode, refer to the Nsight Visual Studio Edition documentation.
When using TensorRT-RTX with the PyCUDA library in Python, use
import pycuda.autoprimaryctx
instead ofimport pycuda.autoinit
in order to avoid device conflicts.Depthwise convolutions/deconvolutions for BF16 precision are not supported.
Convolutions and deconvolutions with both non-unit strides and dilations are not supported for all precisions. Non-unit strided convolutions and deconvolutions, and non-unit dilated convolutions and deconvolutions are supported.
Performance
Use of the CPU-only Ahead-of-Time (AOT) feature can lead to reduced performance for some models; particularly those with multi-head attention (MHA) due to CPU-only AOT’s use of conservative shared memory limits. Affected applications will achieve the best performance if they instead perform AOT compilation on-device, targeted to the specific end-user machine. This can be done with the
--useGPU
flag for thetensorrt_rtx
binary, or if using the APIs, by setting the compute capabilities to contain onlykCURRENT
usingIBuilderConfig::setComputeCapability()
. You can measure performance with both approaches to determine the approach that is best for your application. We plan to resolve this performance discrepancy in a future release.We have prioritized optimizing the performance for 16-bit floating point types, and such models will frequently achieve throughput using TensorRT-RTX that is very close to that achieved with TensorRT. Models that heavily use 32-bit floating point types will still see improvement, but performance will tend to not be as strong as that achieved using TensorRT. Expect performance across many models and data types to improve in future versions of TensorRT-RTX.
Background kernel compilations, triggered as part of the dynamic shapes specialization strategy, are opportunistic and are not guaranteed to complete before network execution finishes. In case your dynamic-shapes workload contains a fixed set of shapes, consider using the
eager
specialization strategy along with the runtime cache to load/store kernels quickly for best performance.When running in CiG mode, some models show significantly reduced performance compared to non-CiG because of suboptimal kernels that are compatible with the shared memory limitations.
Some models show increased VRAM usage compared to running inference for the same model in TensorRT 10.11.
Convolutions and deconvolutions with large filter sizes may have degraded performance.