Working with Runtime Cache#

TensorRT-RTX by default compiles GPU kernels during runtime. Runtime caching helps reduce the startup overhead of GPU kernel compilation by storing compiled kernels on disk for reuse.

Overview#

When runtime cache is enabled, kernels compiled at runtime can be saved to a local cache file, allowing future runs to load them directly instead of recompiling. Clients can significantly improve performance in workflows with frequent kernel reuse or repeated application runs, resulting in faster startup times, and a smoother user experience. Runtime caching is especially beneficial in production environments or iterative development workflows where minimizing latency is critical.

Compatibility Checks#

When using a pre-populated runtime cache, the cache may have been created in a different or outdated environment. Thus, the runtime cache makes the following checks against the runtime environment to ensure the cache is reusable:

  • The runtime environment’s GPU SM version should be equivalent to that of the cached version.

  • The runtime environment’s TensorRT-RTX version should be greater than or equal to that of the cached version.

  • The runtime environment’s CUDA version should be greater than or equal to that of the cached version.

If the above compatibility checks are not met, the pre-compiled runtime cache will not be used and the cached contents will be replaced after the current execution.

APIs#

Runtime caching is a feature introduced to TensorRT-RTX’s execution context creation. Ensure the application has completed the necessary steps up to deserializing the TensorRT-RTX engine. The deserialized engine should be available to create an execution context.

Creating the Runtime Cache#

During context creation and inference, TRT-RTX will compile the necessary kernels for inference. Each of them will be added to the runtime cache without duplication.

  1. Create a runtimeConfig object from the engine. Set the appropriate allocation strategy for the execution context.

    1IRuntimeConfig* runtimeConfig = engine->createRuntimeConfig();
    2
    3runtimeConfig->setExecutionContextAllocationStrategy(ExecutionContextAllocationStrategy::kSTATIC);
    
    1runtime_config = engine.create_runtime_config()
    2
    3runtime_config.set_execution_context_allocation_strategy(trt.ExecutionContextAllocationStrategy.STATIC)
    
  2. Create the runtimeCache object and set it to the runtimeConfig object.

    1IRuntimeCache* runtimeCache = runtimeConfig->createRuntimeCache();
    2
    3runtimeConfig->setRuntimeCache(*runtimeCache);
    
    1runtime_cache = runtime_config.create_runtime_cache()
    2
    3runtime_config.set_runtime_cache(runtime_cache)
    
  3. Create the execution context with the configured runtimeConfig object.

    1IExecutionContext *context = engine->createExecutionContext(runtimeConfig);
    
    1context = engine.create_execution_context(runtime_config)
    

Load and Save the Runtime Cache#

Your application may need to run inference on the same or similar models repeatedly. In such cases, saving the runtime cache to disk allows previously compiled GPU kernels to be loaded and reused across runs.

TensorRT provides sample utility functions to load the cache file on disk to memory for reuse. The utility function loadTimingCacheFile, initially used for build-time timing cache, can be shared by runtime caches as well. The loaded bytes can be used for deserialization before the runtime cache is placed into the runtimeConfig object.

  1. Load the runtime cache and run inference.

     1std::vector<char> loadedCacheBytes
     2    = samplesCommon::loadTimingCacheFile(sample::gLogger, ".\runtime.cache");
     3
     4if (!loadedCacheBytes.empty())
     5{
     6    std::vector<uint8_t> runtimeCacheBytes(
     7                        loadedCacheBytes.begin(), loadedCacheBytes.end());
     8
     9    runtimeCache->deserialize(
    10                 runtimeCacheBytes.data(), runtimeCacheBytes.size());
    11
    12    runtimeConfig->setRuntimeCache(*runtimeCache);
    13}
    
     1// Use TensorRT’s polygraphy library to load and deserialize cache files
     2from polygraphy import util
     3
     4runtime_cache_file = ".\runtime.cache"
     5with util.LockFile(runtime_cache_file):
     6    try:
     7        loaded_cache_bytes = util.load_file(runtime_cache_file)
     8        if loaded_cache_bytes:
     9            runtime_cache.deserialize(loaded_cache_bytes)
    10    except:
    11        G_LOGGER.warning(
    12            f"Did not find runtime cache at: {runtime_cache_file}. ")
    13
    14runtime_config.set_runtime_cache(runtime_cache)
    
  2. After inference is complete, serialize the runtime cache to save it to disk. Save the runtime cache in binary format (for example, std::ios::binary) instead of a text file.

     1// get the runtime config from the execution context
     2IRuntimeConfig* runtimeConfig = context->getRuntimeConfig();
     3
     4// get the runtime cache from the runtime config
     5IRuntimeCache* runtimeCache = runtimeConfig->getRuntimeCache();
     6
     7// serialize the cache into a memory blob
     8IHostMemory* hostMemory = runtimeCache->serialize();
     9assert(hostMemory != nullptr);
    10
    11// save the serialized cache to disk
    12samplesCommon::saveTimingCacheFile(
    13              sample::gLogger, ".\runtime_cache", hostMemory);
    
     1// get the runtime config from the execution context
     2runtime_config = context.get_runtime_config()
     3
     4// get the runtime cache from the runtime config
     5runtime_cache = runtime_config.get_runtime_cache()
     6
     7// serialize the cache into a memory blob and save to disk
     8with util.LockFile(runtime_cache_file):
     9    with runtime_cache.serialize() as buffer:
    10        util.save_file(buffer, runtime_cache_file,
    11                       description="runtime cache")
    

tensorrt_rtx Example#

Runtime caching is added to tensorrt_rtx with the --runtimeCacheFile flag, which takes in a file path to the runtime cache file on disk. Ensure that the provided file path has the read and write permissions.

# sample command on Windows
tensorrt_rtx --onnx=sample.onnx --runtimeCacheFile=.\runtime.cache

The first tensorrt_rtx run will fill the cache with the compilation information and serialize it to the specified file. The following tensorrt_rtx runs can reuse the cache file to speed up inference. The acceleration is greatest when the runtime cache is used for the same or similarly-structured models.