Weight Streaming#

The weight streaming feature allows you to offload some weights from device memory to host memory. During network execution, these weights are streamed from the host to the device as needed. This technique can free up device memory, enabling you to run larger models or process larger batch sizes.

To enable this feature, during engine building, create a network with kSTRONGLY_TYPED and set kWEIGHT_STREAMING to builder config:

1...
2builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kSTRONGLY_TYPED));
3config->setFlag(BuilderFlag::kWEIGHT_STREAMING);
1builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED))
2config.set_flag(trt.BuilderFlag.WEIGHT_STREAMING)

During runtime, deserialization allocates a host buffer to store all the weights instead of uploading them directly to the device. This can increase the host’s peak memory usage. You can use IStreamReaderV2 to deserialize directly from the engine file, avoiding needing a temporary buffer, which helps reduce peak memory usage. IStreamReaderV2 replaces the existing IStreamReader deserialization method.

After deserializing the engine, set the device memory budget for weights by:

1...
2engine->setWeightStreamingBudgetV2(size)
1...
2engine.weight_streaming_budget_v2 = size

The following APIs can help to determine the budget:

  • getStreamableWeightsSize() returns the total size of streamable weights.

  • getWeightStreamingScratchMemorySize() returns the extra scratch memory size for a context when weight streaming is enabled.

  • getDeviceMemorySizeV2() returns the total scratch memory size required by a context. If this API is called before enabling weight streaming by setWeightStreamingBudgetV2(), the return value will not include the extra scratch memory size required by weight streaming, which can be obtained using getWeightStreamingScratchMemorySize(). Otherwise, it will include this extra memory.

You can also combine information about the current free device memory size, context number, and other allocation needs.

TensorRT can also automatically determine a memory budget by getWeightStreamingAutomaticBudget(). However, due to limited information about the user’s specific memory allocation requirements, this automatically determined budget can be suboptimal and potentially lead to out-of-memory errors.

If the budget set by setWeightStreamingBudgetV2 is larger than the total size of streamable weights obtained by getStreamableWeightsSize(), the budget will be clipped to the total size, effectively disabling weight streaming.

You can query the budget set by getWeightStreamingBudgetV2().

The budget can be adjusted by setting it again when there is no active context for the engine.

After setting the budget, TensorRT will automatically determine which weights to retain on the device memory to maximize the overlap between computation and weight fetching.

Cross-Platform Compatibility#

By default, TensorRT engines can only be executed on the same platform (operating system and CPU architecture) where they were built. With build-time configuration, engines can be built to be compatible with other types of platforms. For example, to build an engine on Linux x86_64 platforms and expect the engine to run on Windows x86_64 platforms, configure the IBuilderConfig as follows:

config->setRuntimePlatform(nvinfer1::RuntimePlatform::kWINDOWS_AMD64);

The cross-platform engine might have performance differences from the natively built engine on the target platform. It also cannot run on the host platform it was built on.

When building a cross-platform engine that also requires version forward compatibility, kEXCLUDE_LEAN_RUNTIME must be set to exclude the target platform lean runtime.

Cross-platform building requires installing a separate Windows builder resource library. Refer to the TensorRT Installation Guide for more information.