Advanced#
Working with Quantized Types#
TensorRT-RTX supports the reduced-precision data types INT4, INT8, FP4 and FP8 for improved performance at the cost of accuracy. Note that FP8 is only supported for matrix multiplications (GEMM) on Ada and later architectures (compute capability 8.9 or above), while FP4 is only supported for matrix multiplications on Blackwell or later architectures (compute capability 10.0 or above).
Reduced precision (or quantization) must be explicitly selected by the user for each layer that should use it. This is done by inserting IQuantizeLayer
and IDequantizerLayer
(Q/DQ) nodes in the graph. Quantization can be performed during the training process (Quantization Aware Training = QAT) or in a separate postprocessing step (Post-Training Quantization = PTQ).
Several popular deep learning frameworks allow model quantization using either QAT or PTQ, such as:
PyTorch (Quantization — PyTorch 2.7 documentation)
TensorFlow (Quantization aware training | TensorFlow Model Optimization and Post-training quantization | TensorFlow Model Optimization)
ONNX Runtime (Quantize ONNX models | onnxruntime)
Information about which layers to quantize can be encoded in an ONNX model file via the QuantizeLinear - ONNX 1.19.0 documentation and DequantizeLinear - ONNX 1.19.0 documentation operators, and imported via the TensorRT-RTX ONNX parser.
Finally, the NVIDIA TensorRT Model Optimizer (TensorRT-Model-Optimizer/examples/windows at main · NVIDIA/TensorRT-Model-Optimizer · GitHub) is an open source tool that has been specifically developed to help TensorRT and TensorRT-RTX users to add quantization to a pretrained model in order to achieve higher performance.
Weight Streaming#
TensorRT-RTX offers a weight streaming feature to allow users to run inference on models that are too large to fit in device memory. Instead, parts of the weights are stored in host memory and streamed to the device as needed. To enable this feature, set kWEIGHT_STREAMING
in the builder configuration:
1auto config = builder->createBuilderConfig();
2config->setFlag(BuilderFlag::kWEIGHT_STREAMING);
3auto hostMem = builder->buildSerializedNetwork(*network, *config);
1config = builder.createBuilderConfig()
2config.setFlag(tensorrt_rtx.BuilderFlag.WEIGHT_STREAMING)
3hostMem = builder.buildSerializedNetwork(network, config)
At inference time, after deserializing the engine, users can select what percentage of the overall weights should be kept in device memory at all times. Note that a smaller value will most likely result in performance degradation, so the value should be chosen based on the available device memory:
1auto nbBytesWeights = engine->getStreamableWeightsSize();
2size_t nbBytesDeviceFree;
3size_t nbBytesDeviceTotal;
4cudaMemGetInfo(&nbBytesDeviceFree, &nbBytesDeviceTotal);
5// For this example, assume the weights to be kept in device memory
6// should be the smaller of:
7// (1) half the free device memory
8// (2) half the total streamable weights
9auto weightBudget = std::min(static_cast<int64_t>(nbBytesDeviceFree/2),
10 nbBytesWeights/2);
11engine->setWeightStreamingBudgetV2(weightBudget);
1nbBytesWeights = engine.streamable_weights_size
2free, total = pycuda.driver.mem_get_info()
3# For this example, assume the weights to be kept in device memory
4# should be the smaller of:
5# (1) half the free device memory
6# (2) half the total streamable weights
7weightBudget = min(free//2, nbBytesWeights//2)
8engine.weight_streaming_budget_v2 = weightBudget
Efficient Engine Compatibility Checking#
Serialized TensorRT-RTX engine files may be incompatible with the current installation of TensorRT-RTX for the following reasons:
The currently installed version of TensorRT-RTX does not match the version with which the engine was built.
The engine file was built to support a different compute capability than that of the currently installed device.
The currently installed CUDA driver or runtime are older than those present during engine build.
The serialized model weights require more device memory than is available on the currently installed device, and the engine has not been built to support weight streaming.
The engine file is malformed in some other way, for example, because it was built with TensorRT rather than TensorRT-RTX.
In all these situations, the serialized engine needs to be rebuilt either:
programmatically using
IBuilder::getSerializedNetwork()
(refer to thebuildEngine()
function in Using the TensorRT-RTX Runtime API for an example), oron the command line with
tensorrt_rtx --saveEngine
(refer to Ahead-of-Time (AOT) Build),
in order to be usable for inference on the current system.
To understand whether a rebuild is necessary, one option is trying to deserialize the engine with IRuntime::deserializeCudaEngine()
and noting whether the call fails by returning a null pointer. However, this requires loading the entire serialized engine into a memory buffer, which may require several gigabytes of main memory. This is time and memory consuming, particularly for software products that need to manage many different models for different inference tasks. Techniques like memory-mapped files can alleviate this problem, but are not always a viable option (for example, if the file is located on a network drive).
Instead, call the IRuntime::getEngineValidity()
API to only inspect the engine header for metadata indicating whether deserialization is likely to succeed or fail. This only requires loading a small number of bytes from the start of the engine file into main memory, which can be queried with IRuntime::getEngineHeaderSize()
. Currently, this is 64 bytes, although the number may be increased in future versions of TensorRT-RTX. If the header indicates that the engine is not compatible with the current system, IRuntime::getEngineValidity()
will return EngineValidity::kINVALID
and the reasons for the failure will be flagged via a bitmask output argument. The typical use is as follows:
1auto nbBytes = runtime->getEngineHeaderSize();
2const char* engineFilePath = "/path/to/engine";
3std::vector<char> headerData(nbBytes);
4std::ifstream file(engineFilePath, std::ios_base::binary);
5if (!file.read(&headerData[0], nbBytes)){
6 throw "File too short";
7}
8uint64_t diagnostics;
9auto validity = runtime->getEngineValidity(&headerData[0], nbBytes, &diagnostics);
10if (validity == EngineValidity::kINVALID){
11 bool needsWeightsStreaming = diagnostics & uint64_t(EngineInvalidityDiagnostics::kINSUFFICIENT_GPU_MEMORY);
12 bool incorrectCuda = diagnostics & uint64_t(EngineInvalidityDiagnostics::kCUDA_ERROR);
13 if (incorrectCuda){
14 // Signal to user to fix their CUDA installation
15 } else if (needsWeightStreaming){
16 // rebuild the engine with weight streaming enabled
17 } else {
18 // rebuild the engine
19 }
20}
1nb_bytes = runtime.engine_header_size
2file_path = "/path/to/engine"
3with open(file_path, "rb") as f:
4 file_bytes = f.read(nb_bytes)
5buffer = memoryview(file_bytes)
6valid, diagnostics = runtime.get_engine_validity(buffer)
7if valid == trt.EngineValidity.INVALID:
8 needs_weight_streaming = bool(diagnostics & trt.EngineValidityDiagnostics.INSUFFICIENT_CUDA_MEMORY.value)
9 incorrect_cuda = bool(diagnostics & trt.EngineValidityDiagnostics.CUDA_ERROR.value)
10 if incorrect_cuda:
11 # Signal to user to fix their CUDA installation
12 elif needs_weight_streaming:
13 # Rebuild engine with weight streaming enabled
14 else:
15 # Rebuild the engine
Note that while there may be multiple different error conditions (CUDA runtime too old, CUDA driver too old, mismatching TensorRT-RTX version, and so on), most of them will have the same remedy: rebuilding the TensorRT-RTX AOT engine for the current system. However, we provide the more granular information to help application developers with debugging.
Errors in the CUDA installation (EngineInvalidityDiagnostics::kCUDA_ERROR
) need to be treated differently, for example, by restarting the machine or even reinstalling the CUDA runtime. If the detected CUDA driver or CUDA runtime are too old for the engine (EngineInvalidityDiagnostics::kOLD_CUDA_DRIVER
or EngineInvalidityDiagnostics::kOLD_CUDA_RUNTIME
), application developers may decide whether to rebuild the engine or require the desktop users to update their installation. Insufficient CUDA memory will require rebuilding the engine with weight streaming enabled (refer to Weight Streaming). Note that diagnostics should be treated as a bit mask, and that multiple error conditions may be simultaneously true (corresponding to multiple bits being set to 1
).
Apart from EngineValidity::kVALID
and EngineValidity::kINVALID
, the IRuntime::getEngineValidity()
API may also return EngineValidity::kSUBOPTIMAL
to indicate that the engine can be used for inference but with suboptimal performance. This may happen when the currently installed GPU belongs to a newer architecture than was available when TensorRT-RTX was built (currently this means a compute capability larger than 12.0). Unless the engine has been built for a specific (older) compute capability, it can still support inference on these newer architectures via forward-compatible PTX assembly but the performance is likely not going to be on par with native Cubin code. Achieving state-of-the-art performance will require updating TensorRT-RTX in addition to rebuilding the engines.
Finally, keep in mind that IRuntime::getEngineValidity()
only inspects a short header and therefore cannot guarantee that engine deserialization will succeed even if the return value is EngineValidity::kVALID
. (However, returning EngineValidity::kINVALID
guarantees that the deserialization will fail.) The reason is that later segments in the serialized engine file may have become corrupted. However, if users can verify that the engine file has been produced by TensorRT-RTX without corruption (for example, by checking a SHA-256 hash), they can expect that a successful validity check will result in successful deserialization.