Holoscan SDK GPU-Resident Graphs
Holoscan SDK’s GPU-resident graphs enable deterministic, real-time and low-latency execution of Holoscan applications by keeping the (CUDA) compute pipeline on the GPU for the lifetime of an application. GPU-resident graph execution mode allows an application to be operated entirely on the GPU, without (or with minimal) involvement from the CPU. This mode eliminates scheduling, synchronization, orchestration and systems overheads on the CPU, making GPU applications more predictable. Unlike traditional CPU-driven scheduling and execution, GPU-resident graphs leverage CUDA Graphs to capture and replay an entire Holoscan application directly on the GPU, eliminating costly CPU-GPU coordination and synchronization overhead.
For sensor inputs and actuation outputs, Holoscan SDK GPU-resident graphs are combined with devices and mechanisms using GPU-direct RDMA technologies such as Holoscan Sensor Bridge, DOCA GPUNetIO. Holoscan SDK GPU-resident graphs also support low-latency, highly responsive and predictable visualization outputs, especially on NVIDIA G-SYNC supported monitors.
The GPU-resident graphs do not follow the traditional Holoscan SDK execution workflow and cannot be interconnected with a traditional Holoscan SDK fragment and operator. It is a standalone and unique execution model, supported by a separate Holoscan SDK executor backend.
GPU-resident graphs are only supported in C++. Python support is planned for the future.
Key Benefits
Deterministic Performance: CUDA Graphs provide consistent and predictable execution time and end-to-end latency by eliminating scheduling, coordination and operating system overheads on the CPU.
Reduced Latency: Eliminates CPU-GPU synchronization and CPU-driven GPU workload scheduling overheads by executing entirely on the GPU
Fast and Low-latency I/O: GPU-direct RDMA combined with GPU-resident graphs in Holoscan SDK eliminates costly and unpredictable I/O interrupt and management overheads on the CPU, enabling fast and low-latency I/O for the GPU-resident applications.
Scalability: CPU-dependent workloads in embedded systems are limited by the small number of CPU cores available to scale applications for a multitude of sensors and actuators for robotics, medical, and other real-time and embedded applications. In contrast, the exponentially higher number of CUDA cores on the GPU allows embedded applications to easily scale. GPU-resident graphs take advantage of this scalability of the GPUs to support a huge numbers of sensors and actuators.
Efficient Resource Usage: Pre-allocated device memory and optimized data flow
When to Use GPU-Resident Graphs
GPU-resident graphs are ideal for:
Real-time applications where predictable and consistent execution timing is critical
Applications with CUDA-only operators
Applications that either have no CPU-based process dependencies or where a CPU-based Holoscan fragment can work asynchronously with a GPU-resident (CUDA-only) Holoscan fragment
GPU-Resident Operators
GPU-resident operators inherit from holoscan::GPUResidentOperator instead of the standard holoscan::Operator class. These operators:
Execute entirely on the GPU using CUDA kernels
Use device memory for input/output ports
Are captured into CUDA Graphs
GPU-Resident Fragments
A GPU-resident Fragment is created by composing GPU-resident operators. The framework automatically detects that a holoscan::Fragment should use GPU-resident graphs when all operators in the fragment inherit from holoscan::GPUResidentOperator. Currently, only linear chains of operators are supported. A DAG of operators will be supported in the future.
To create a GPU-resident Fragment:
Create operators that inherit from
holoscan::GPUResidentOperatorCompose them in a standard Fragment using
make_operator<>()andadd_flow()The framework will automatically enable GPU-resident graphs during initialization
class MyGpuResidentFragment : public holoscan::Fragment {
public:
void compose() override {
// Create GPU-resident operators
auto source = make_operator<SourceGpuOp>("source");
auto compute = make_operator<ComputeGpuOp>("compute");
auto sink = make_operator<SinkGpuOp>("sink");
// Connect them in a linear chain
add_flow(source, compute);
add_flow(compute, sink);
}
};
// Holoscan SDK will automatically detect that MyGpuResidentFragment should use GPU-resident graphs
Access GPU-resident controls via fragment->gpu_resident() after calling run_async(). More details on the API reference section.
There can only be one GPU-resident fragment per Holoscan SDK application. Other traditional Holoscan SDK fragments can co-exist with a GPU-resident fragment in the same application. However, connections between GPU-resident and traditional Holoscan SDK fragments and operators (with GXF backend) are not supported.
Main Workload Processing GPU-resident Fragment
The GPU-resident fragment is designated as the main workload processing fragment. The main data processing such as image processing, AI model inference happens in this main fragment. There is also an optional data ready handler fragment that can be used to augment the main fragment to automatically handle external sensor data inputs.
Data Ready Handler (Optional)
A data ready handler is an optional feature that allows a GPU-resident fragment to handle external data sources to be handled on the GPU itself. By leveraging this feature, a CUDA kernel, used as a data ready handler, can check whether new data is available for processing, and a GPU-resident data processing pipeline can subsequently be triggered.
A data ready handler is created with a separate GPU-resident fragment that is registered with the main GPU-resident fragment. This handler runs at the beginning of each iteration. It can determine if data is ready for processing. It is important to note that the data ready handler fragment is not a separately executable GPU-resident fragment in a Holoscan SDK application. This fragment is only used as a part of the main workload processing GPU-resident fragment. Holoscan SDK currently only allows one main workload processing GPU-resident fragment per application.
The data ready handler allows for:
Custom data availability checking logic
Conditional execution of the main workload
Integration with external sensor data sources by leveraging GPU-direct technologies such as Holoscan Sensor Bridge.
This feature allows integration of GPU-direct technologies into the Holoscan SDK GPU-resident graph execution mode. By leveraging this feature, developers build sensor data-driven GPU-resident pipelines. As sensor data arrives in the GPU, without any involvement from the CPU, the data ready handler will detect it and trigger the main GPU-resident (data processing) workload.
CUDA Graph Backend
When a GPU-resident fragment is initialized, the framework:
Allocates device memory for all inter-operator connections based on port specifications
Captures operator execution into CUDA Graphs by recording the
compute()methodCreates a GPU-resident graph execution (that optionally includes control flow for data ready checking)
The GPU-resident CUDA graph is launched once from the host CPU process. The graph keeps running on the GPU until the application terminates. For every new (sensor) data inputs, the GPU-resident CUDA graph processes the data without ever leaving the GPU. In the absence of any intervention from the host CPU process, the GPU-resident graphs maintain deterministic execution timing for sensor data processing.
Graph Execution Flow
Application calls
run_async()to start GPU-resident graphsGPU-resident CUDA graph is launched asynchronously
For each iteration:
Data ready handler (if present) checks if input data is available
If the data is ready, main workload operators execute in sequence
Result ready signal is set when processing completes
Host CPU-driven GPU-resident Graphs
Optionally, the host CPU can also control the GPU-resident graphs by the following steps:
Write input data to device memory
Call
data_ready()to trigger processingCheck
result_ready()to know when results are availableRead output data from device memory until the
data_ready()signal is set again
This is useful for debugging, development, testing and cases where GPU-direct technologies are not available/yet integrated.
When using host CPU-driven graphs with functions like cudaMemcpy to read back results between iterations, enable sync_with_host() before launching the graph to guarantee that all device memory writes are visible to the host before result_ready() returns true. See sync_with_host for details.
GPUResidentOperator
The base class for GPU-resident operators is holoscan::GPUResidentOperator. Inherit from this class to create operators that execute in GPU-resident mode.
Port Declaration
Ports can be declared in two ways: by memory block size (the executor allocates shared device memory) or by device pointer (the operator provides its own pre-allocated device memory).
Memory block size (executor-allocated)
class MyGpuOp : public holoscan::GPUResidentOperator {
void setup(OperatorSpec& spec) override {
// Declare device input port with memory size (in bytes)
spec.device_input("in", sizeof(float) * num_elements);
// Declare device output port with memory size (in bytes)
spec.device_output("out", sizeof(float) * num_elements);
}
};
Use device_input() and device_output() with a size_t or integer literal to declare ports with executor-allocated device memory. The executor will allocate a shared device buffer for each connection. Connected ports map to the same device memory address.
Device pointer (operator-managed)
class MyCustomAllocGpuOp : public holoscan::GPUResidentOperator {
void setup(OperatorSpec& spec) override {
// Allocate device memory externally
cudaMalloc(&my_input_ptr_, sizeof(float) * num_elements);
cudaMalloc(&my_output_ptr_, sizeof(float) * num_elements);
// Declare ports with externally managed device pointers
spec.device_input("in", reinterpret_cast<CUdeviceptr>(my_input_ptr_));
spec.device_output("out", reinterpret_cast<CUdeviceptr>(my_output_ptr_));
}
private:
void* my_input_ptr_ = nullptr;
void* my_output_ptr_ = nullptr;
};
Use device_input() and device_output() with a CUdeviceptr or void* argument to supply an externally allocated device pointer. The executor will use this pointer directly instead of allocating its own buffer. The connected port on the other operator will also map to this pointer.
Integer literals (e.g. 0) always resolve to the memory block size overload, not the device pointer overload. The device pointer overload is only selected when the argument type is explicitly CUdeviceptr or void*.
Connection Strategy
When two operators are connected, the executor decides how to set up shared device memory based on what each port declares:
Both ports declare a memory block size – the executor allocates a shared buffer (sizes must match).
One port declares a device pointer – the executor uses that pointer for both ports. If the other port has a memory block size, it is ignored (with a warning).
Both ports declare a device pointer – the source’s pointer is used (an error is logged).
Only one port declares a memory block size (the other has neither) – the executor allocates using the available size (with a warning).
Neither port declares anything – a runtime error is thrown.
Important Helper Methods
cuda_stream()
std::shared_ptr<cudaStream_t> cuda_stream();
Returns the CUDA stream for launching kernels in the operator’s compute() method.
device_memory(port_name)
void* device_memory(const std::string& port_name);
Returns the device memory address for a given input or output port. Use this to access pre-allocated buffers for kernel launches.
data_ready_handler_cuda_stream()
std::shared_ptr<cudaStream_t> data_ready_handler_cuda_stream();
Returns the CUDA stream for data ready handler operations.
data_ready_device_address()
void* data_ready_device_address();
Returns the device memory pointer for the data ready signal. This address can be used in data ready handler’s CUDA kernels to signal that data is ready for processing. See holoscan/core/executors/gpu_resident/gpu_resident_dev.cuh for CUDA device functions like gpu_resident_mark_data_ready_dev() and gpu_resident_mark_data_not_ready_dev() where this address can be used.
Example Operator Implementation
class ComputeGpuOp : public holoscan::GPUResidentOperator {
public:
HOLOSCAN_OPERATOR_FORWARD_ARGS_SUPER(ComputeGpuOp, holoscan::GPUResidentOperator)
void setup(OperatorSpec& spec) override {
spec.device_input("in", sizeof(float) * 1024);
spec.device_output("out", sizeof(float) * 1024);
}
void compute(holoscan::InputContext& op_input,
holoscan::OutputContext& op_output,
holoscan::ExecutionContext& context) override {
// Get device memory addresses
auto* input_addr = device_memory("in");
auto* output_addr = device_memory("out");
// Get CUDA stream for kernel launch
auto stream_ptr = cuda_stream();
cudaStream_t stream = *stream_ptr;
// Launch CUDA kernel
my_kernel<<<grid, block, 0, stream>>>(
static_cast<float*>(input_addr),
static_cast<float*>(output_addr),
1024);
}
};
When the size of an input or output port is not known at setup() time, the size can be set to zero (a warning will be logged). Later, initialize() method can be used to set the final size of the port.
Fragment GPU-Resident API
Access GPU-resident functionality through the Fragment::gpu_resident() accessor.
auto fragment = make_fragment<MyGpuResidentFragment>();
// Configure before launching (optional)
fragment->gpu_resident().timeout_ms(5000); // Set timeout to 5 seconds
fragment->gpu_resident().data_not_ready_sleep_interval_us(250); // Sleep 250us when waiting for data
// Launch the fragment
auto future = fragment->run_async();
// Control during graph execution
fragment->gpu_resident().data_ready(); // Signal data is ready
fragment->gpu_resident().tear_down(); // Tear down GPU-resident fragment (terminates the GPU-resident CUDA graph)
Control and Status Checking Methods from Host (CPU) Process
tear_down()
void tear_down();
Sends a tear down signal to stop GPU-resident graph. It can take some time to tear down the GPU-resident CUDA graph. Check with is_launched() function to know if the graph has been torn down. Note: If timeout_ms is set to non-zero value, then the application will automatically be torn down after the timeout duration.
is_launched()
bool is_launched();
Returns true if the GPU-resident CUDA graph has been launched and is running, false otherwise. Use this to wait for initialization to complete before sending data. If the graph has been torn down, this function will return false in that case.
data_ready()
void data_ready();
Signals that input data is ready for processing. Call this after writing data to the application’s input device memory. This could be the device memory allocated to the source operator of an application pipeline.
result_ready()
bool result_ready();
Returns true if the current iteration’s results are ready for consumption, false otherwise. Poll this after calling data_ready() to know when to read output data.
The data_ready, result_ready and other such CPU-side control methods can affect the deterministic performance of the GPU-resident CUDA graph and should be used with caution.
Configuration Methods
timeout_ms(timeout)
void timeout_ms(unsigned long long timeout_ms);
Sets the timeout for GPU-resident graph in milliseconds. GPU-resident graph will be torn down after the timeout duration. If nothing is set or set to 0, then the graph will run indefinitely until tear_down() is called.
data_not_ready_sleep_interval_us(sleep_interval_us)
void data_not_ready_sleep_interval_us(unsigned int sleep_interval_us = 500);
Sets the sleep interval on the GPU device when data is not ready. The GPU-resident graph loop will sleep for this duration (in microseconds) before checking the data ready signal again. This helps reduce unnecessary GPU polling and power consumption when waiting for new data. Default is 500 microseconds. Lower values provide faster response to data ready signals but increase GPU and power usage, while higher values reduce GPU usage but may introduce increased latency.
Important: This setting must be configured before calling run_async() as it cannot be changed after the CUDA graph has been launched.
sync_with_host(enable)
void sync_with_host(bool enable = true);
Enables or disables a system-wide memory fence at the end of each GPU-resident iteration. When enabled, the GPU issues a system-wide fence (__threadfence_system()) after the workload completes and before signaling result-ready. This ensures that all device memory writes are globally visible to the host before the result-ready flag is observed.
This option is intended for scenarios where the host controls the GPU-resident graph loop and reads back results between iterations (e.g., via cudaMemcpy). It is recommended for debugging, development, and testing purposes.
Enabling sync_with_host adds latency to each iteration and is not recommended for performance-critical workloads. When the GPU-resident pipeline is driven entirely by GPU-side data ready handlers (no host-side readback between iterations), this option is not recommended and not required.
Important: This setting must be configured before calling run_async() as it cannot be changed after the CUDA graph has been launched.
register_data_ready_handler(fragment)
void register_data_ready_handler(std::shared_ptr<Fragment> data_ready_handler_fragment);
Registers a data ready handler fragment that executes at the beginning of each iteration to determine if data is ready.
Initialization Phase
A GPU-resident fragment is initialized in the following steps:
Graph Topology Verification: The framework verifies the operator graph forms a supported topology (only linear chains are currently supported; DAGs will be supported in the future).
Device Memory Setup: For each connection between operators:
If both ports specify a memory block size, the executor allocates a shared device buffer
If a port specifies a device pointer, the executor uses that pointer directly (no allocation)
Memory addresses are mapped to operator ports
Connected operator ports map to same device memory addresses
CUDA Graph Capture:
A CUDA stream is created for graph capture
Each operator’s
compute()method is executed during captureCUDA operations (kernel launches, memcpy, etc.) are recorded into a graph
For data ready handlers, a separate graph is captured
GPU-Resident Graph Construction:
A conditional node checks the data ready signal
If data is ready, the main workload graph executes
A result ready signal is set upon completion
A tear down check determines if graph execution should continue
Graph Execution Phase
In the execution phase, there is no CPU-driven graph execution unless explicitly requested by the host CPU process. During asynchronous execution:
Graph Launch: The GPU-resident CUDA graph is launched on a dedicated stream
Iteration Loop (on the GPU):
The graph polls the data ready signal and tear down signal
If data is not ready, the GPU sleeps for the configured interval (default 500 μs) before checking again
When data ready signal is set, executes all operators
Sets result ready signal when processing is complete
When tear down signal is set, the graph is torn down
Memory Management
Pre-allocated Buffers: All inter-operator buffers are pre-allocated during initialization, either by the executor (memory block size) or by the operator itself (device pointer)
Zero-Copy Access: Operators access device memory directly via
device_memory()Persistent Addresses: Memory addresses remain constant throughout execution
No Dynamic Allocation: Our graph execution mandates static memory layouts
Operator-managed Memory: Operators can supply their own device pointers via
device_input()/device_output()with aCUdeviceptrorvoid*, giving full control over allocation strategy while still participating in the GPU-resident data flow
Topological Ordering
Operators are executed in topological order based on the dataflow graph. The framework:
Determines the correct execution sequence
Ensures dependencies are satisfied before operator execution
Captures operators in the correct order into the CUDA graph
Linear Topology Only: Currently supports only linear chains of operators
Static Memory: All memory must be pre-allocated; dynamic memory allocation not supported
Single Device: Multi-GPU execution not yet supported.
CUDA Device 0is used for GPU-resident graph execution by default.No Scheduler: Cannot use standard Holoscan schedulers to connect with GPU-resident graph execution
Some of the limitations such as supporting linear chain only are temporary and will be eliminated in the future releases.
Fully working examples demonstrating GPU-resident graph execution are available at:
public/examples/gpu_resident_example/gpu_resident_example.cpp
public/examples/gpu_resident_input/gpu_resident_input.cpp
Operator Design
Keep compute() lightweight: Only launch CUDA kernels; avoid CPU work as CPU calls won’t be repeated in the graph execution phase.
Operator granularity: GPU-resident graph execution captures the workflow into a CUDA Graph and replays it on the GPU. This typically has much lower per-operator scheduling overhead (~0.5–2 µs kernel transition latency) than CPU-driven graph execution, making finer-grained operator decomposition more practical. See Performance Considerations for comparison with CPU-based scheduling.
Use provided streams: Always use
cuda_stream()for kernel launches in the main workload anddata_ready_handler_cuda_stream()for kernel launches in the data ready handler.Pre-calculate sizes: Determine buffer sizes at setup time, not runtime
Avoid CPU-driven GPU Controls: Avoid explicit CUDA synchronization and other CPU-driven GPU controls to get the most deterministic performance.
Performance Tuning
Tune sleep interval: Adjust
data_not_ready_sleep_interval_us()based on your application needs:Low latency applications (e.g., high-speed sensors): Use shorter intervals (e.g., 100-250 μs) for faster response to new data
Power-constrained applications: Use longer intervals (e.g., 500-1000 μs) to reduce GPU polling overhead
Default (500 μs): Provides a balanced trade-off between latency and GPU utilization
Compute Requirements: Depending on the rest of the application pipeline’s compute requirements, the sleep interval must be adjusted to ensure expected Quality-of-service and power usage trade-off.
Monitor GPU utilization: Use NVIDIA tools (nvidia-smi, Nsight Systems) to verify GPU usage patterns match expectations
Error Handling
Check is_launched(): Ensure graph is ready before sending data
Handle timeouts: Implement timeout logic when waiting for results
Clean up properly: Always call
tear_down()before application exit
Graph Not Launching
If is_launched() never returns true:
Check for initialization errors in operator
compute()methodsVerify all operators inherit from
GPUResidentOperator
Result Never Ready
If result_ready() never returns true:
Verify
data_ready()was called after writing input dataCheck for kernel errors
Ensure operators are launching kernels correctly
Memory Access Errors
If encountering CUDA memory errors:
Verify buffer sizes match between port declaration and usage
Check that device memory addresses are not null before use
Ensure no out-of-bounds access in kernels
Performance Issues
If execution is slower than expected:
Profile with NVIDIA Nsight Systems to identify bottlenecks
Check for unnecessary CUDA synchronization
Verify kernels are launched with optimal grid/block dimensions