Advanced#

Working with Quantized Types#

TensorRT-RTX supports the reduced-precision data types INT4, INT8, FP4 and FP8 for improved performance at the cost of accuracy. Note that FP8 is only supported for matrix multiplications (GEMM) on Ada and later architectures (compute capability 8.9 or above), while FP4 is only supported for matrix multiplications on Blackwell or later architectures (compute capability 10.0 or above).

Reduced precision (or quantization) must be explicitly selected by the user for each layer that should use it. This is done by inserting IQuantizeLayer and IDequantizerLayer (Q/DQ) nodes in the graph. Quantization can be performed during the training process (Quantization Aware Training = QAT) or in a separate postprocessing step (Post-Training Quantization = PTQ).

Several popular deep learning frameworks allow model quantization using either QAT or PTQ, such as:

PyTorch (Quantization — PyTorch 2.7 documentation)
TensorFlow (Quantization aware training | TensorFlow Model Optimization and Post-training quantization | TensorFlow Model Optimization)
ONNX Runtime (Quantize ONNX models | onnxruntime)

Information about which layers to quantize can be encoded in an ONNX model file via the QuantizeLinear - ONNX 1.19.0 documentation and DequantizeLinear - ONNX 1.19.0 documentation operators, and imported via the TensorRT-RTX ONNX parser.

Finally, the NVIDIA TensorRT Model Optimizer (TensorRT-Model-Optimizer/examples/windows at main · NVIDIA/TensorRT-Model-Optimizer · GitHub) is an open source tool that has been specifically developed to help TensorRT and TensorRT-RTX users to add quantization to a pretrained model in order to achieve higher performance.

Weight Streaming#

TensorRT-RTX offers a weight streaming feature to allow users to run inference on models that are too large to fit in device memory. Instead, parts of the weights are stored in host memory and streamed to the device as needed. To enable this feature, set kWEIGHT_STREAMING in the builder configuration:

C++

auto config = builder->createBuilderConfig();
config->setFlag(BuilderFlag::kWEIGHT_STREAMING);
auto hostMem = builder->buildSerializedNetwork(*network, *config);

Python

config = builder.createBuilderConfig()
config.setFlag(tensorrt_rtx.BuilderFlag.WEIGHT_STREAMING)
hostMem = builder.buildSerializedNetwork(network, config)

At inference time, after deserializing the engine, users can select what percentage of the overall weights should be kept in device memory at all times. Note that a smaller value will most likely result in performance degradation, so the value should be chosen based on the available device memory:

C++

auto nbBytesWeights = engine->getStreamableWeightsSize();
size_t nbBytesDeviceFree;
size_t nbBytesDeviceTotal;
cudaMemGetInfo(&nbBytesDeviceFree, &nbBytesDeviceTotal);
// For this example, assume the weights to be kept in device memory
// should be the smaller of:
// (1) half the free device memory
// (2) half the total streamable weights
auto weightBudget = std::min(static_cast<int64_t>(nbBytesDeviceFree/2),
        nbBytesWeights/2);
engine->setWeightStreamingBudgetV2(weightBudget);

Python

nbBytesWeights = engine.streamable_weights_size
free, total = pycuda.driver.mem_get_info()
# For this example, assume the weights to be kept in device memory
# should be the smaller of:
# (1) half the free device memory
# (2) half the total streamable weights
weightBudget = min(free//2, nbBytesWeights//2)
engine.weight_streaming_budget_v2 = weightBudget

Efficient Engine Compatibility Checking#

Serialized TensorRT-RTX engine files may be incompatible with the current installation of TensorRT-RTX for the following reasons:

The currently installed version of TensorRT-RTX does not match the version with which the engine was built.
The engine file was built to support a different compute capability than that of the currently installed device.
The currently installed CUDA driver or runtime are older than those present during engine build.
The serialized model weights require more device memory than is available on the currently installed device, and the engine has not been built to support weight streaming.
The engine file is malformed in some other way, for example, because it was built with TensorRT rather than TensorRT-RTX.

In all these situations, the serialized engine needs to be rebuilt either:

programmatically using IBuilder::getSerializedNetwork() (refer to the buildEngine() function in Using the TensorRT-RTX Runtime API for an example), or
on the command line with tensorrt_rtx --saveEngine (refer to Ahead-of-Time (AOT) Build),

in order to be usable for inference on the current system.

To understand whether a rebuild is necessary, one option is trying to deserialize the engine with IRuntime::deserializeCudaEngine() and noting whether the call fails by returning a null pointer. However, this requires loading the entire serialized engine into a memory buffer, which may require several gigabytes of main memory. This is time and memory consuming, particularly for software products that need to manage many different models for different inference tasks. Techniques like memory-mapped files can alleviate this problem, but are not always a viable option (for example, if the file is located on a network drive).

Instead, call the IRuntime::getEngineValidity() API to only inspect the engine header for metadata indicating whether deserialization is likely to succeed or fail. This only requires loading a small number of bytes from the start of the engine file into main memory, which can be queried with IRuntime::getEngineHeaderSize(). Currently, this is 64 bytes, although the number may be increased in future versions of TensorRT-RTX. If the header indicates that the engine is not compatible with the current system, IRuntime::getEngineValidity() will return EngineValidity::kINVALID and the reasons for the failure will be flagged via a bitmask output argument. The typical use is as follows:

C++

auto nbBytes = runtime->getEngineHeaderSize();
const char* engineFilePath = "/path/to/engine";
std::vector<char> headerData(nbBytes);
std::ifstream file(engineFilePath, std::ios_base::binary);
if (!file.read(&headerData[0], nbBytes)){
    throw "File too short";
}
uint64_t diagnostics;
auto validity = runtime->getEngineValidity(&headerData[0], nbBytes, &diagnostics);
if (validity == EngineValidity::kINVALID){
    bool needsWeightsStreaming = diagnostics & uint64_t(EngineInvalidityDiagnostics::kINSUFFICIENT_GPU_MEMORY);
    bool incorrectCuda = diagnostics & uint64_t(EngineInvalidityDiagnostics::kCUDA_ERROR);
    if (incorrectCuda){
        // Signal to user to fix their CUDA installation
    } else if (needsWeightStreaming){
        // rebuild the engine with weight streaming enabled
    } else {
        // rebuild the engine
    }
}

Python

nb_bytes = runtime.engine_header_size
file_path = "/path/to/engine"
with open(file_path, "rb") as f:
    file_bytes = f.read(nb_bytes)
buffer = memoryview(file_bytes)
valid, diagnostics = runtime.get_engine_validity(buffer)
if valid == trt.EngineValidity.INVALID:
    needs_weight_streaming = bool(diagnostics & trt.EngineValidityDiagnostics.INSUFFICIENT_CUDA_MEMORY.value)
    incorrect_cuda = bool(diagnostics & trt.EngineValidityDiagnostics.CUDA_ERROR.value)
    if incorrect_cuda:
        # Signal to user to fix their CUDA installation
    elif needs_weight_streaming:
        # Rebuild engine with weight streaming enabled
    else:
        # Rebuild the engine

Note that while there may be multiple different error conditions (CUDA runtime too old, CUDA driver too old, mismatching TensorRT-RTX version, and so on), most of them will have the same remedy: rebuilding the TensorRT-RTX AOT engine for the current system. However, we provide the more granular information to help application developers with debugging.

Errors in the CUDA installation (EngineInvalidityDiagnostics::kCUDA_ERROR) need to be treated differently, for example, by restarting the machine or even reinstalling the CUDA runtime. If the detected CUDA driver or CUDA runtime are too old for the engine (EngineInvalidityDiagnostics::kOLD_CUDA_DRIVER or EngineInvalidityDiagnostics::kOLD_CUDA_RUNTIME), application developers may decide whether to rebuild the engine or require the desktop users to update their installation. Insufficient CUDA memory will require rebuilding the engine with weight streaming enabled (refer to Weight Streaming). Note that diagnostics should be treated as a bit mask, and that multiple error conditions may be simultaneously true (corresponding to multiple bits being set to 1).

Apart from EngineValidity::kVALID and EngineValidity::kINVALID, the IRuntime::getEngineValidity() API may also return EngineValidity::kSUBOPTIMAL to indicate that the engine can be used for inference but with suboptimal performance. This may happen when the currently installed GPU belongs to a newer architecture than was available when TensorRT-RTX was built (currently this means a compute capability larger than 12.0). Unless the engine has been built for a specific (older) compute capability, it can still support inference on these newer architectures via forward-compatible PTX assembly but the performance is likely not going to be on par with native Cubin code. Achieving state-of-the-art performance will require updating TensorRT-RTX in addition to rebuilding the engines.

Finally, keep in mind that IRuntime::getEngineValidity() only inspects a short header and therefore cannot guarantee that engine deserialization will succeed even if the return value is EngineValidity::kVALID. (However, returning EngineValidity::kINVALID guarantees that the deserialization will fail.) The reason is that later segments in the serialized engine file may have become corrupted. However, if users can verify that the engine file has been produced by TensorRT-RTX without corruption (for example, by checking a SHA-256 hash), they can expect that a successful validity check will result in successful deserialization.