Is this page helpful?

Weight Streaming#

The weight streaming feature allows you to offload some weights from device memory to host memory. During network execution, these weights are streamed from the host to the device as needed. This technique can free up device memory, enabling you to run larger models or process larger batch sizes.

To enable this feature, during engine building, create a network with kSTRONGLY_TYPED and set kWEIGHT_STREAMING to builder config:

C++

...
builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kSTRONGLY_TYPED));
config->setFlag(BuilderFlag::kWEIGHT_STREAMING);

Python

builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED))
config.set_flag(trt.BuilderFlag.WEIGHT_STREAMING)

During runtime, deserialization allocates a host buffer to store all the weights instead of uploading them directly to the device. This can increase the host’s peak memory usage. You can use IStreamReaderV2 to deserialize directly from the engine file, avoiding needing a temporary buffer, which helps reduce peak memory usage. IStreamReaderV2 replaces the existing IStreamReader deserialization method.

After deserializing the engine, set the device memory budget for weights by:

C++

...
engine->setWeightStreamingBudgetV2(size)

Python

...
engine.weight_streaming_budget_v2 = size

The following APIs can help to determine the budget:

getStreamableWeightsSize() returns the total size of streamable weights.
getWeightStreamingScratchMemorySize() returns the extra scratch memory size for a context when weight streaming is enabled.
getDeviceMemorySizeV2() returns the total scratch memory size required by a context. If this API is called before enabling weight streaming by setWeightStreamingBudgetV2(), the return value will not include the extra scratch memory size required by weight streaming, which can be obtained using getWeightStreamingScratchMemorySize(). Otherwise, it will include this extra memory.

You can also combine information about the current free device memory size, context number, and other allocation needs.

TensorRT can also automatically determine a memory budget by getWeightStreamingAutomaticBudget(). However, due to limited information about the user’s specific memory allocation requirements, this automatically determined budget can be suboptimal and potentially lead to out-of-memory errors.

If the budget set by setWeightStreamingBudgetV2 is larger than the total size of streamable weights obtained by getStreamableWeightsSize(), the budget will be clipped to the total size, effectively disabling weight streaming.

You can query the budget set by getWeightStreamingBudgetV2().

The budget can be adjusted by setting it again when there is no active context for the engine.

After setting the budget, TensorRT will automatically determine which weights to retain on the device memory to maximize the overlap between computation and weight fetching.