Holoscan SDK’s GPU-resident graphs enable deterministic, real-time and low-latency execution of Holoscan applications by keeping the (CUDA) compute pipeline on the GPU for the lifetime of an application. GPU-resident graph execution mode allows an application to be operated entirely on the GPU, without (or with minimal) involvement from the CPU. This mode eliminates scheduling, synchronization, orchestration and systems overheads on the CPU, making GPU applications more predictable. Unlike traditional CPU-driven scheduling and execution, GPU-resident graphs leverage CUDA Graphs to capture and replay an entire Holoscan application directly on the GPU, eliminating costly CPU-GPU coordination and synchronization overhead.
For sensor inputs and actuation outputs, Holoscan SDK GPU-resident graphs are combined with devices and mechanisms using GPU-direct RDMA technologies such as Holoscan Sensor Bridge, DOCA GPUNetIO. Holoscan SDK GPU-resident graphs also support low-latency, highly responsive and predictable visualization outputs, especially on NVIDIA G-SYNC supported monitors.
The GPU-resident graphs do not follow the traditional Holoscan SDK execution workflow with GXF backend and cannot be interconnected with a GXF-backend fragment and operator. It is a standalone and unique execution model, supported by a separate Holoscan SDK executor backend.
GPU-resident graphs are only supported in C++. Python support is planned for the future.
GPU-resident graphs are ideal for:
GPU-resident operators inherit from holoscan::GPUResidentOperator instead of the standard holoscan::Operator class. These operators:
The SDK also supports concrete holoscan::ops classes that inherit from holoscan::GPUResidentOperator. They follow the same static buffer and capture rules as in the API reference section below. These are distinct from the CPU-driven operators listed under Holoscan Operators (for example holoscan::ops::FormatConverterOp): GPU-resident variants fix sizes and formats at initialization and use device ports only.
Additional GPU-resident operators are available in HoloHub:
A GPU-resident Fragment is created by composing GPU-resident operators. The framework automatically detects that a holoscan::Fragment should use GPU-resident graphs when all operators in the fragment inherit from holoscan::GPUResidentOperator. GPU-resident execution supports acyclic operator graphs with a single source operator. During initialization, the framework flattens the graph in topological order before connecting device memory and capturing compute() calls into CUDA Graphs.
To create a GPU-resident Fragment:
holoscan::GPUResidentOperatormake_operator<>() and add_flow()Access GPU-resident controls via fragment->gpu_resident() after calling run_async(). More details on the API reference section.
There can only be one GPU-resident fragment per Holoscan SDK application. Other traditional Holoscan SDK fragments can co-exist with a GPU-resident fragment in the same application. However, connections between GPU-resident and traditional Holoscan SDK fragments and operators (with GXF backend) are not supported.
The GPU-resident fragment is designated as the main workload processing fragment. The main data processing such as image processing, AI model inference happens in this main fragment. There is also an optional data ready handler fragment that can be used to augment the main fragment to automatically handle external sensor data inputs.
A data ready handler is an optional feature that allows a GPU-resident fragment to handle external data sources to be handled on the GPU itself. By leveraging this feature, a CUDA kernel, used as a data ready handler, can check whether new data is available for processing, and a GPU-resident data processing pipeline can subsequently be triggered.
A data ready handler is created with a separate GPU-resident fragment that is registered with the main GPU-resident fragment. This handler runs at the beginning of each iteration. It can determine if data is ready for processing. It is important to note that the data ready handler fragment is not a separately executable GPU-resident fragment in a Holoscan SDK application. This fragment is only used as a part of the main workload processing GPU-resident fragment. Holoscan SDK currently only allows one main workload processing GPU-resident fragment per application.
The data ready handler allows for:
This feature allows integration of GPU-direct technologies into the Holoscan SDK GPU-resident graph execution mode. By leveraging this feature, developers build sensor data-driven GPU-resident pipelines. As sensor data arrives in the GPU, without any involvement from the CPU, the data ready handler will detect it and trigger the main GPU-resident (data processing) workload.
When a GPU-resident fragment is initialized, the framework:
compute() methodThe GPU-resident CUDA graph is launched once from the host CPU process. The graph keeps running on the GPU until the application terminates. For every new (sensor) data inputs, the GPU-resident CUDA graph processes the data without ever leaving the GPU. In the absence of any intervention from the host CPU process, the GPU-resident graphs maintain deterministic execution timing for sensor data processing.
run_async() to start GPU-resident graphsOptionally, the host CPU can also control the GPU-resident graphs by the following steps:
data_ready() to trigger processingresult_ready() to know when results are availabledata_ready() signal is set againThis is useful for debugging, development, testing and cases where GPU-direct technologies are not available/yet integrated.
When using host CPU-driven graphs with functions like cudaMemcpy to read back results between iterations, enable sync_with_host() before launching the graph to guarantee that all device memory writes are visible to the host before result_ready() returns true. See sync_with_host for details.
The base class for GPU-resident operators is holoscan::GPUResidentOperator. Inherit from this class to create operators that execute in GPU-resident mode.
Ports can be declared in two ways: by memory block size (the executor allocates shared device memory) or by device pointer (the operator provides its own pre-allocated device memory).
Memory block size (executor-allocated)
Use device_input() and device_output() with a size_t or integer literal to declare ports with executor-allocated device memory. The executor will allocate a shared device buffer for each connection. Connected ports map to the same device memory address. Operators may declare multiple input and/or output ports; each port is connected independently via the port map in add_flow().
Device pointer (operator-managed)
Use device_input() and device_output() with a CUdeviceptr or void* argument to supply an externally allocated device pointer. The executor will use this pointer directly instead of allocating its own buffer. The connected port on the other operator will also map to this pointer.
Integer literals (e.g. 0) always resolve to the memory block size overload, not the device pointer overload. The device pointer overload is only selected when the argument type is explicitly CUdeviceptr or void*.
When two operators are connected, the executor decides how to set up shared device memory for each port pair independently based on what each port declares:
cuda_stream()
Returns the CUDA stream for launching kernels in the operator’s compute() method.
device_memory(port_name)
Returns the device memory address for a given input or output port. Use this to access pre-allocated buffers for kernel launches.
data_ready_handler_cuda_stream()
Returns the CUDA stream for data ready handler operations.
data_ready_device_address()
Returns the device memory pointer for the data ready signal. This address can be used in data ready handler’s CUDA kernels to signal that data is ready for processing. See holoscan/core/executors/gpu_resident/gpu_resident_dev.cuh for CUDA device functions like gpu_resident_mark_data_ready_dev() and gpu_resident_mark_data_not_ready_dev() where this address can be used.
When the size of an input or output port is not known at setup() time, the size can be set to zero (a warning will be logged). Later, initialize() method can be used to set the final size of the port.
Access GPU-resident functionality through the Fragment::gpu_resident() accessor.
tear_down()
Sends a tear down signal to stop GPU-resident graph. It can take some time to tear down the GPU-resident CUDA graph. Check with is_launched() function to know if the graph has been torn down. Note: If timeout_ms is set to non-zero value, then the application will automatically be torn down after the timeout duration.
is_launched()
Returns true if the GPU-resident CUDA graph has been launched and is running, false otherwise. Use this to wait for initialization to complete before sending data. If the graph has been torn down, this function will return false in that case.
data_ready()
Signals that input data is ready for processing. Call this after writing data to the application’s input device memory. This could be the device memory allocated to the source operator of an application pipeline.
result_ready()
Returns true if the current iteration’s results are ready for consumption, false otherwise. Poll this after calling data_ready() to know when to read output data.
The data_ready, result_ready and other such CPU-side control methods can affect the deterministic performance of the GPU-resident CUDA graph and should be used with caution.
timeout_ms(timeout)
Sets the timeout for GPU-resident graph in milliseconds. GPU-resident graph will be torn down after the timeout duration. If nothing is set or set to 0, then the graph will run indefinitely until tear_down() is called.
data_not_ready_sleep_interval_us(sleep_interval_us)
Sets the sleep interval on the GPU device when data is not ready. The GPU-resident graph loop will sleep for this duration (in microseconds) before checking the data ready signal again. This helps reduce unnecessary GPU polling and power consumption when waiting for new data. Default is 500 microseconds. Lower values provide faster response to data ready signals but increase GPU and power usage, while higher values reduce GPU usage but may introduce increased latency.
Important: This setting must be configured before calling run_async() as it cannot be changed after the CUDA graph has been launched.
sync_with_host(enable)
Enables or disables a system-wide memory fence at the end of each GPU-resident iteration. When enabled, the GPU issues a system-wide fence (__threadfence_system()) after the workload completes and before signaling result-ready. This ensures that all device memory writes are globally visible to the host before the result-ready flag is observed.
This option is intended for scenarios where the host controls the GPU-resident graph loop and reads back results between iterations (e.g., via cudaMemcpy). It is recommended for debugging, development, and testing purposes.
Enabling sync_with_host adds latency to each iteration and is not recommended for performance-critical workloads. When the GPU-resident pipeline is driven entirely by GPU-side data ready handlers (no host-side readback between iterations), this option is not recommended and not required.
Important: This setting must be configured before calling run_async() as it cannot be changed after the CUDA graph has been launched.
register_data_ready_handler(fragment)
Registers a data ready handler fragment that executes at the beginning of each iteration to determine if data is ready.
A GPU-resident fragment is initialized in the following steps:
Graph Topology Verification: The framework verifies that the operator graph is a supported topology: a DAG with exactly one source operator.
Device Memory Setup: For each connection between operators:
CUDA Graph Capture:
compute() method is executed during captureGPU-Resident Graph Construction:
In the execution phase, there is no CPU-driven graph execution unless explicitly requested by the host CPU process. During asynchronous execution:
device_memory()device_input()/device_output() with a CUdeviceptr or void*, giving full control over allocation strategy while still participating in the GPU-resident data flowOperators are executed in topological order based on the dataflow graph. The framework:
CUDA Device 0 is used for GPU-resident graph execution by default.Some of the topology limitations are temporary and will be relaxed in future releases.
Fully working examples demonstrating GPU-resident graph execution are available at:
public/examples/gpu_resident_example/gpu_resident_example.cpp
public/examples/gpu_resident_input/gpu_resident_input.cpp
public/examples/gpu_resident_multi_io/gpu_resident_multi_io.cpp — operators with multiple input/output ports
public/examples/gpu_resident_inference_example/ — GPUResidentInferenceOp with TensorRT
cuda_stream() for kernel launches in the main workload and data_ready_handler_cuda_stream() for kernel launches in the data ready handler.data_not_ready_sleep_interval_us() based on your application needs:
tear_down() before application exitIf is_launched() never returns true:
compute() methodsGPUResidentOperatorIf result_ready() never returns true:
data_ready() was called after writing input dataIf encountering CUDA memory errors:
If execution is slower than expected: