Class Fragment::GPUResidentAccessor
Defined in File fragment.hpp
This class is a nested type of Class Fragment.
-
class GPUResidentAccessor
Accessor class for GPU-resident specific functions of a Fragment.
This class provides a convenient interface for accessing GPU-resident specific functionality of a Fragment. It acts as a mediator to expose GPU-resident operations through a cleaner API pattern:
fragment->gpu_resident().function().This is a lightweight accessor class that maintains a reference to the parent Fragment.
Public Functions
-
GPUResidentAccessor() = delete
-
inline explicit GPUResidentAccessor(Fragment *fragment)
Construct a new GPUResidentAccessor object.
- Parameters
fragment – Pointer to the parent Fragment
-
void timeout_ms(unsigned long long timeout_ms)
Set the timeout for GPU-resident execution.
GPU-resident execution occurs asynchronously. This sets the timeout so that execution is stopped after it exceeds the specified duration.
- Parameters
timeout_ms – The timeout in milliseconds.
-
void tear_down()
Send a tear down signal to the GPU-resident CUDA graph.
The timeout has to be set to zero for this to work for now.
-
bool result_ready()
Check if the result of a single iteration of the GPU-resident CUDA graph is ready.
- Returns
true if the result is ready, false otherwise.
-
void data_ready()
Inform the GPU-resident CUDA graph that the data is ready for the main workload.
-
bool is_launched()
Check if the GPU-resident CUDA graph has been launched.
- Returns
true if the CUDA graph has been launched, false otherwise.
-
cudaGraph_t workload_graph()
Get the CUDA graph of the main workload in this fragment. This returns a clone of the main workload graph, and certain CUDA graph nodes (e.g. memory allocation, memory free, conditional nodes) are not supported for cloning.
- Returns
A clone of the CUDA graph of the main workload in this fragment.
-
void *data_ready_device_address()
Get the CUDA device pointer for the data_ready signal.
This returns the actual device memory address that the GPU-resident CUDA graph uses to check if data is ready for processing. Can be used for advanced GPU-resident applications that need direct access to these control signals.
- Returns
Pointer to the device memory location for data_ready signal.
-
void *result_ready_device_address()
Get the CUDA device pointer for the result_ready signal.
Similar to data_ready_device_address(), but for the result_ready signal.
- Returns
Pointer to the device memory location for result_ready signal.
-
void *tear_down_device_address()
Get the CUDA device pointer for the tear_down signal.
Similar to data_ready_device_address(), but for the tear_down signal.
- Returns
Pointer to the device memory location for tear_down signal.
Register a data ready handler to this fragment. The data ready handler will be executed at the beginning of every iteration of the GPU-resident CUDA Graph. The data ready handler will usually indicate whether input data is ready for processing. If the data ready handler marks data to be ready, then main workload CUDA graph will be executed on this iteration, otherwise main workload processing will be skipped for this iteration.
- Parameters
data_ready_handler_fragment – Shared pointer to a fragment that will be added as the data ready handler to this fragmemt.
-
std::shared_ptr<Fragment> data_ready_handler_fragment()
Get the registered data ready handler fragment.
- Returns
The data ready handler fragment, or nullptr if none is registered.
-
void data_not_ready_sleep_interval_us(unsigned int sleep_interval_us = 500)
Set the sleep interval on device when data is not ready. In each iteration of the GPU-resident loop execution, it will sleep for the specified interval if the data is not ready. This is useful to save energy and reduce heat generation.
- Parameters
sleep_interval_us – the sleep interval in microseconds. Default is 500 us.
-
void sync_with_host(bool enable = true)
Enable or disable a system-wide memory fence at the end of each GPU-resident iteration.
When enabled, the GPU issues a system-wide fence (
__threadfence_system()) after the workload completes and before signaling result-ready. This ensures that all device memory writes are globally visible to the host before the result-ready flag is observed.This option is intended for scenarios where the host controls the GPU-resident execution loop and reads back results between iterations (e.g., via
cudaMemcpy). It is recommended for debugging, development, and testing purposes.Must be called before the GPU-resident CUDA graph is launched.
NoteEnabling this adds latency to each iteration and is not recommended for performance-critical workloads.
- Parameters
enable – true to enable the system-wide fence, false to disable (default: false).
-
void enable_perf_measurement(unsigned int num_samples = 100)
Enable execution time measurement. Execution time is the time between the start of a streaming data iteration and the end of the same iteration. Execution time is not measured when the data is not marked as ready.
It is important to note that the GPU-resident execution can continue longer than the number of samples to collect. However, the execution times are only collected for the provided number of samples. Since the execution times are stored in device memory, they are not collected for unbounded number of iterations.
- Parameters
num_samples – the total number of samples to collect. Default is 100.
-
void save_perf_results_as_csv(const std::string &filename = "gpu_resident_perf.csv")
Saves the execution times in microseconds as a CSV file.
- Parameters
filename – the name of the file to save the execution times in microseconds.
-
void print_perf_metrics(unsigned int skip_first = 10, unsigned int skip_last = 10)
Prints a few key metrics information about the execution times.
- Parameters
skip_first – the number of samples to skip at the beginning. Default is 10.
skip_last – the number of samples to skip at the end. Default is 10.
-
GPUResidentAccessor() = delete