Is this page helpful?

Working with RTX CUDA Graphs#

TensorRT for RTX provides out-of-the-box CUDA Graph support. This enables CUDA Graph optimizations with a single line of code change and natively supports intelligent dynamic shape graph capture.

Introduction#

The NVIDIA CUDA Toolkit includes support for CUDA Graphs, which enables the capture or construction of execution graphs that optimize the GPU operation launch workflow. CUDA Graphs help reduce runtime overhead by minimizing CPU-to-GPU kernel launch overhead and lowering GPU grid initialization costs.

Currently, you can implement CUDA Graphs in your TensorRT inference workflows by capturing the stream during inference. However, with TensorRT for RTX using dynamic shapes, TensorRT compiles and runs two types of dynamic shape kernels: fallback kernels and shape-specialized kernels. This can make it difficult to determine the optimal points to begin and end CUDA Graph capture.

To address this, TensorRT-RTX now provides a native API that requires only a single line of code change to enable intelligent CUDA Graph capture for dynamic shapes. This API allows more precise and effective graph management, even in the presence of runtime shape variation.

We expect the most performance improvements from this API in the following scenarios:

Inference with dynamic input shapes
Windows operating systems with hardware-accelerated GPU scheduling

APIs#

This RTX CUDA Graphs feature is introduced as part of the runtime config (IRuntimeConfig), which is in turn used for execution context creation. Ensure the application has completed the necessary steps up to deserializing the TensorRT-RTX engine. The deserialized engine should be available to create an execution context.

This RTX CUDA Graphs feature is disabled by default. The following flow demonstrates how to enable it:

Create a runtimeConfig object from the engine.

C++

IRuntimeConfig* runtimeConfig = engine->createRuntimeConfig();

Python

runtime_config = engine.create_runtime_config()

Set the CUDA Graph strategy to whole graph capture and confirm the setting.

C++

bool success = runtimeConfig->setCudaGraphStrategy(nvinfer1::CudaGraphStrategy::kWHOLE_GRAPH_CAPTURE);

assert(success);

Python

runtime_config.cuda_graph_strategy = trt.CudaGraphStrategy.WHOLE_GRAPH_CAPTURE

Create the execution context with the configured runtimeConfig object.

C++

IExecutionContext *context = engine->createExecutionContext(runtimeConfig);

Python

context = engine.create_execution_context(runtime_config)

Sample Implementations#

Example: `tensorrt_rtx`#

RTX CUDA Graph support is available in tensorrt_rtx via the --rtxCudaGraphStrategy flag. This flag controls how CUDA Graphs are enabled and used during inference.

The following usage methods are available:

disable (default) – Disables RTX CUDA Graph functionality.
wholeGraph – Enables and captures the entire inference graph using CUDA Graphs.

# sample command on Windows
tensorrt_rtx.exe --onnx=.\sample.onnx --rtxCudaGraphStrategy=wholeGraph

# sample command on Linux
tensorrt_rtx --onnx=sample.onnx --rtxCudaGraphStrategy=wholeGraph

Example: Open Source Software#

Refer to our OSS sample implementations of the RTX CUDA Graph in our OSS repo:

Limitations#

Even when enabled, the RTX CUDA Graph feature will not be utilized in the following scenarios:

The provided stream does not allow graph capture
The stream is already being captured elsewhere
The GPU memory allocation strategy is blocking
The engine contains layers with data-dependent dynamic shapes or on-device control flow (for example, if-else nodes)
The engine is streaming weights during execution

For applications involving dynamic shape inference, we strongly recommend using RTX CUDA Graphs via the native API, rather than implementing custom graph capture logic.

Working with RTX CUDA Graphs#

Introduction#

APIs#

Sample Implementations#

Example: tensorrt_rtx#

Example: Open Source Software#

Limitations#

Example: `tensorrt_rtx`#