Advanced Topics#

Weight Streaming#

TensorRT-RTX offers a weight streaming feature to allow users to run inference on models that are too large to fit in device memory. Instead, parts of the weights are stored in host memory and streamed to the device as needed. To enable this feature, set kWEIGHT_STREAMING in the builder configuration:

1auto config = builder->createBuilderConfig();
2config->setFlag(BuilderFlag::kWEIGHT_STREAMING);
3auto hostMem = builder->buildSerializedNetwork(*network, *config);
1config = builder.createBuilderConfig()
2config.setFlag(tensorrt_rtx.BuilderFlag.WEIGHT_STREAMING)
3hostMem = builder.buildSerializedNetwork(network, config)

At inference time, after deserializing the engine, users can select what percentage of the overall weights should be kept in device memory at all times. A smaller value can result in performance degradation, so choose the value based on the available device memory:

 1auto nbBytesWeights = engine->getStreamableWeightsSize();
 2size_t nbBytesDeviceFree;
 3size_t nbBytesDeviceTotal;
 4cudaMemGetInfo(&nbBytesDeviceFree, &nbBytesDeviceTotal);
 5// For this example, assume the weights to be kept in device memory
 6// should be the smaller of:
 7// (1) half the free device memory
 8// (2) half the total streamable weights
 9auto weightBudget = std::min(static_cast<int64_t>(nbBytesDeviceFree/2),
10        nbBytesWeights/2);
11engine->setWeightStreamingBudgetV2(weightBudget);
1nbBytesWeights = engine.streamable_weights_size
2free, total = pycuda.driver.mem_get_info()
3# For this example, assume the weights to be kept in device memory
4# should be the smaller of:
5# (1) half the free device memory
6# (2) half the total streamable weights
7weightBudget = min(free//2, nbBytesWeights//2)
8engine.weight_streaming_budget_v2 = weightBudget

Efficient Engine Compatibility Checking#

Serialized TensorRT-RTX engine files can be incompatible with the current installation of TensorRT-RTX for the following reasons:

  • The installed version of TensorRT-RTX does not match the version with which the engine was built.

  • The engine file was built to support a different compute capability than that of the installed device.

  • The installed CUDA driver or runtime are older than those present during engine build.

  • The serialized model weights require more device memory than is available on the installed device, and the engine has not been built to support weight streaming.

  • The engine file is malformed in some other way, for example, because it was built with TensorRT rather than TensorRT-RTX.

In all these situations, the serialized engine needs to be rebuilt either:

to be usable for inference on the current system.

To understand whether a rebuild is necessary, one option is trying to deserialize the engine with IRuntime::deserializeCudaEngine() and noting whether the call fails by returning a null pointer. However, this requires loading the entire serialized engine into a memory buffer, which can require several gigabytes of main memory. This is time and memory consuming, particularly for software products that need to manage many different models for different inference tasks. Techniques like memory-mapped files can alleviate this problem, but are not always a viable option (for example, if the file is located on a network drive).

Important

getEngineValidity() only inspects a short header (64 bytes) — it does not fully validate the engine. A result of EngineValidity::kVALID means deserialization is likely to succeed, not guaranteed.

Instead, call the IRuntime::getEngineValidity() API to only inspect the engine header for metadata indicating whether deserialization is likely to succeed or fail. This only requires loading a small number of bytes from the start of the engine file into main memory, which can be queried with IRuntime::getEngineHeaderSize(). If the header indicates that the engine is not compatible with the current system, IRuntime::getEngineValidity() will return EngineValidity::kINVALID and the reasons for the failure will be flagged via a bitmask output argument. The typical use is as follows:

 1auto nbBytes = runtime->getEngineHeaderSize();
 2const char* engineFilePath = "/path/to/engine";
 3std::vector<char> headerData(nbBytes);
 4std::ifstream file(engineFilePath, std::ios_base::binary);
 5if (!file.read(&headerData[0], nbBytes)){
 6    throw "File too short";
 7}
 8uint64_t diagnostics;
 9auto validity = runtime->getEngineValidity(&headerData[0], nbBytes, &diagnostics);
10if (validity == EngineValidity::kINVALID){
11    bool needsWeightsStreaming = diagnostics & uint64_t(EngineInvalidityDiagnostics::kINSUFFICIENT_GPU_MEMORY);
12    bool incorrectCuda = diagnostics & uint64_t(EngineInvalidityDiagnostics::kCUDA_ERROR);
13    if (incorrectCuda){
14        // Signal to user to fix their CUDA installation
15    } else if (needsWeightStreaming){
16        // rebuild the engine with weight streaming enabled
17    } else {
18        // rebuild the engine
19    }
20}
 1import tensorrt_rtx as trt
 2
 3nb_bytes = runtime.engine_header_size
 4file_path = "/path/to/engine"
 5with open(file_path, "rb") as f:
 6    file_bytes = f.read(nb_bytes)
 7buffer = memoryview(file_bytes)
 8valid, diagnostics = runtime.get_engine_validity(buffer)
 9if valid == trt.EngineValidity.INVALID:
10    needs_weight_streaming = bool(diagnostics & trt.EngineValidityDiagnostics.INSUFFICIENT_CUDA_MEMORY.value)
11    incorrect_cuda = bool(diagnostics & trt.EngineValidityDiagnostics.CUDA_ERROR.value)
12    if incorrect_cuda:
13        # Signal to user to fix their CUDA installation
14    elif needs_weight_streaming:
15        # Rebuild engine with weight streaming enabled
16    else:
17        # Rebuild the engine

Multiple error conditions can occur (CUDA runtime too old, CUDA driver too old, mismatching TensorRT-RTX version, and more), but most share the same remedy: rebuild the TensorRT-RTX AOT engine for the current system. The granular information helps application developers debug.

Errors in the CUDA installation (EngineInvalidityDiagnostics::kCUDA_ERROR) need to be treated differently, for example, by restarting the machine or even reinstalling the CUDA runtime. If the detected CUDA driver or CUDA runtime are too old for the engine (EngineInvalidityDiagnostics::kOLD_CUDA_DRIVER or EngineInvalidityDiagnostics::kOLD_CUDA_RUNTIME), application developers can decide whether to rebuild the engine or require the desktop users to update their installation. Insufficient CUDA memory will require rebuilding the engine with weight streaming enabled (refer to Weight Streaming). Treat diagnostics as a bit mask; multiple error conditions can be simultaneously true (corresponding to multiple bits set to 1).

Apart from EngineValidity::kVALID and EngineValidity::kINVALID, the IRuntime::getEngineValidity() API can also return EngineValidity::kSUBOPTIMAL to indicate that the engine can be used for inference but with suboptimal performance. This can happen when the installed GPU belongs to a newer architecture than was available when TensorRT-RTX was built (in this release, a compute capability greater than 12.0). Unless the engine has been built for a specific (older) compute capability, it can still support inference on these newer architectures via forward-compatible PTX assembly but the performance is likely not going to be on par with native Cubin code. Achieving state-of-the-art performance will require updating TensorRT-RTX in addition to rebuilding the engines.

Finally, keep in mind that IRuntime::getEngineValidity() only inspects a short header and therefore cannot guarantee that engine deserialization will succeed even if the return value is EngineValidity::kVALID. (However, returning EngineValidity::kINVALID guarantees that the deserialization will fail.) The reason is that later segments in the serialized engine file may have become corrupted. However, if users can verify that the engine file has been produced by TensorRT-RTX without corruption (for example, by checking a SHA-256 hash), they can expect that a successful validity check will result in successful deserialization.