Advanced#

Working with Quantized Types#

TensorRT-RTX supports the reduced-precision data types INT8, FP4 and FP8 for improved performance at the cost of accuracy. Note that FP8 is only supported for matrix multiplications (GEMM) on Ada and later architectures (compute capability 8.9 or above), while FP4 is only supported for matrix multiplications on Blackwell or later architectures (compute capability 10.0 or above).

Reduced precision (or quantization) must be explicitly selected by the user for each layer that should use it. This is done by inserting IQuantizeLayer and IDequantizerLayer (Q/DQ) nodes in the graph. Quantization can be performed during the training process (Quantization Aware Training = QAT) or in a separate postprocessing step (Post-Training Quantization = PTQ).

Several popular deep learning frameworks allow model quantization using either QAT or PTQ, such as:

Information about which layers to quantize can be encoded in an ONNX model file via the QuantizeLinear - ONNX 1.19.0 documentation and DequantizeLinear - ONNX 1.19.0 documentation operators, and imported via the TensorRT-RTX ONNX parser.

Finally, the NVIDIA TensorRT Model Optimizer (TensorRT-Model-Optimizer/examples/windows at main · NVIDIA/TensorRT-Model-Optimizer · GitHub) is an open source tool that has been specifically developed to help TensorRT and TensorRT-RTX users to add quantization to a pretrained model in order to achieve higher performance.

Weight Streaming#

TensorRT-RTX offers a weight streaming feature to allow users to run inference on models that are too large to fit in device memory. Instead, parts of the weights are stored in host memory and streamed to the device as needed. To enable this feature, set kWEIGHT_STREAMING in the builder configuration:

1auto config = builder->createBuilderConfig();
2config->setFlag(BuilderFlag::kWEIGHT_STREAMING);
3auto hostMem = builder->buildSerializedNetwork(*network, *config);
1config = builder.createBuilderConfig()
2config.setFlag(tensorrt_rtx.BuilderFlag.WEIGHT_STREAMING)
3hostMem = builder.buildSerializedNetwork(network, config)

At inference time, after deserializing the engine, users can select what percentage of the overall weights should be kept in device memory at all times. Note that a smaller value will most likely result in performance degradation, so the value should be chosen based on the available device memory:

 1auto nbBytesWeights = engine->getStreamableWeightsSize();
 2size_t nbBytesDeviceFree;
 3size_t nbBytesDeviceTotal;
 4cudaMemGetInfo(&nbBytesDeviceFree, &nbBytesDeviceTotal);
 5// For this example, assume the weights to be kept in device memory
 6// should be the smaller of:
 7// (1) half the free device memory
 8// (2) half the total streamable weights
 9auto weightBudget = std::min(static_cast<int64_t>(nbBytesDeviceFree/2),
10        nbBytesWeights/2);
11engine->setWeightStreamingBudgetV2(weightBudget);
1nbBytesWeights = engine.streamable_weights_size
2free, total = pycuda.driver.mem_get_info()
3# For this example, assume the weights to be kept in device memory
4# should be the smaller of:
5# (1) half the free device memory
6# (2) half the total streamable weights
7weightBudget = min(free//2, nbBytesWeights//2)
8engine.weight_streaming_budget_v2 = weightBudget

Performance Measurement#

The tensorrt_rtx executable can be used for measuring inference performance, by measuring and averaging inference latencies over repeated iterations. For example, the following Powershell command shows how to measure latencies for a BERT-style transformer model (from model.onnx · google-bert/bert-base-uncased at main):

tensorrt_rtx.exe    --onnx=model.onnx `
--shapes=input_ids:1x128,attention_mask:1x128,token_type_ids:1x128 `
--verbose `
--useCudaGraph `
--noDataTransfers `
--useSpinWait `
--iterations=50 `
--duration=90 `
--percentile=5,95 `
--warmUp=100

Here, we will not explain the full set of configuration flags, which can be viewed with the command tensorrt_rtx --help. Instead we focus on a few flags that are particularly useful for performance measurements:

  • --verbose: Captures detailed logging information.

  • --duration=90 / --iterations=50: Iterates over at least 90 seconds, and at least 50 iterations (whichever is longer).

  • --percentile=5,95: Records the 5th percentile and the 95th percentile of the latency distribution (90% of inference runs are expected to fall in this interval).

  • --warmUp=100: Performs inference warm-up runs for 100 ms before starting the measurements.

  • --noDataTransfers: Disables host-to-device and device-to-host data transfers to focus on measuring computation time rather than data transfer time.