NVIDIA Docs Hub NVIDIA Holoscan NVIDIA Holoscan SDK v3.4.0 Class GPUResourceMonitor

Class GPUResourceMonitor

Defined in File gpu_resource_monitor.hpp

Class Documentation

class GPUResourceMonitor

This class is responsible for monitoring the GPU resources. It provides the information about the GPU resources (through holoscan::GPUInfo) to the SystemResourceManager class.

The following holoscan::GPUMetricFlag flags are supported:

DEFAULT: Default GPU metrics (GPU_DEVICE_ID)
GPU_DEVICE_ID: GPU device ID (name, pci, serial, uuid)
GPU_UTILIZATION: GPU utilization (gpu_utilization, memory_utilization)
MEMORY_USAGE: GPU memory usage (memory_total, memory_free, memory_used, memory_usage)
POWER_LIMIT: GPU power limit (power_limit)
POWER_USAGE: GPU power usage (power_usage)
TEMPERATURE: GPU temperature (temperature)
ALL: All GPU metrics above

index information is always available.

This uses the NVML library to get the GPU information. If NVML library is not available (in case of iGPU), this class uses the CUDA Runtime API to get the GPU information.

The following information is not available when using the CUDA Runtime API:

GPU_DEVICE_ID: pci.pciDeviceId and pci.pciSubSystemId are not available
GPU_UTILIZATION: gpu_utilization and memory_utilization are not available
POWER_LIMIT: power_limit is not available
POWER_USAGE: power_usage is not available
TEMPERATURE: temperature is not available

Example:

Copy
Copied!

            
            #include <holoscan/core/system/system_resource_manager.hpp>
#include <holoscan/logger/logger.hpp>

...

holoscan::GPUResourceMonitor gpu_resource_monitor;
gpu_resource_monitor.update(holoscan::GPUMetricFlag::ALL);
auto gpu_info = gpu_resource_monitor.gpu_info();
auto gpu_count = gpu_resource_monitor.num_gpus();

for (int i = 0; i < gpu_count; i++) {
  // Print GPU information (GPUInfo)
  HOLOSCAN_LOG_INFO("GPU {} is available", gpu_info[i].index);
  HOLOSCAN_LOG_INFO("GPU {} name: {}", i, gpu_info[i].name);
  HOLOSCAN_LOG_INFO("GPU {} is iGPU: {}", i, gpu_info[i].is_integrated);
  HOLOSCAN_LOG_INFO("GPU {} pci.busId: {}", i, gpu_info[i].pci.busId);
  HOLOSCAN_LOG_INFO("GPU {} pci.busIdLegacy: {}", i, gpu_info[i].pci.busIdLegacy);
  HOLOSCAN_LOG_INFO("GPU {} pci.domain: {}", i, gpu_info[i].pci.domain);
  HOLOSCAN_LOG_INFO("GPU {} pci.bus: {}", i, gpu_info[i].pci.bus);
  HOLOSCAN_LOG_INFO("GPU {} pci.device: {}", i, gpu_info[i].pci.device);
  HOLOSCAN_LOG_INFO("GPU {} pci.pciDeviceId: {:x}:{:x}",
                    i,
                    gpu_info[i].pci.pciDeviceId & 0xffff,
                    gpu_info[i].pci.pciDeviceId >> 16);
  HOLOSCAN_LOG_INFO("GPU {} pci.pciSubSystemId: {:x}:{:x}",
                    i,
                    gpu_info[i].pci.pciSubSystemId & 0xffff,
                    gpu_info[i].pci.pciSubSystemId >> 16);
  HOLOSCAN_LOG_INFO("GPU {} serial: {}", i, gpu_info[i].serial);
  HOLOSCAN_LOG_INFO("GPU {} uuid: {}", i, gpu_info[i].uuid);
  HOLOSCAN_LOG_INFO("GPU {} gpu_utilization: {}", i, gpu_info[i].gpu_utilization);
  HOLOSCAN_LOG_INFO("GPU {} memory_utilization: {}", i, gpu_info[i].memory_utilization);
  HOLOSCAN_LOG_INFO("GPU {} memory_total: {}", i, gpu_info[i].memory_total);
  HOLOSCAN_LOG_INFO("GPU {} memory_free: {}", i, gpu_info[i].memory_free);
  HOLOSCAN_LOG_INFO("GPU {} memory_used: {}", i, gpu_info[i].memory_used);
  HOLOSCAN_LOG_INFO("GPU {} memory_usage: {}", i, gpu_info[i].memory_usage);
  HOLOSCAN_LOG_INFO("GPU {} power_limit: {}", i, gpu_info[i].power_limit);
  HOLOSCAN_LOG_INFO("GPU {} power_usage: {}", i, gpu_info[i].power_usage);
  HOLOSCAN_LOG_INFO("GPU {} temperature: {}", i, gpu_info[i].temperature);
}

Public Functions

explicit GPUResourceMonitor(uint64_t metric_flags = kDefaultGpuMetrics)

Construct a new GPUResourceMonitor object.

This constructor creates a new GPUResourceMonitor object.

Parameters: metric_flags – The metric flags (default: GPU_DEVICE_ID)

virtual ~GPUResourceMonitor()

void init(): Initialize the GPU resource monitor.

void close()

Close handle of the GPU resource monitor.

This function closes the handle of the opened NVML and CUDA Runtime libraries if they are open.

uint64_t metric_flags() const

Get metric flags.

This function returns the metric flags.

Returns: The metric flags.

void metric_flags(uint64_t metric_flags)

Set metric flags.

This function sets the metric flags.

Parameters: metric_flags – The metric flags

GPUInfo update(uint32_t index, uint64_t metric_flags = GPUMetricFlag::DEFAULT)

Update the GPU information and cache it.

This function updates information for the GPU with the given index based on the given metric flags and returns the GPU information. If the metric flags are not provided, the existing metric flags are used. It also caches the GPU information.

Parameters

index – The GPU index.
metric_flags – The metric flags.

Returns

The GPU information.

std::vector<GPUInfo> update(uint64_t metric_flags = GPUMetricFlag::DEFAULT)

Update all GPU information and cache it.

This function updates the information for all GPUs based on the given metric flags and returns a vector of GPU information. If the metric flags are not provided, the existing metric flags are used. It also caches the GPU information.

Parameters: metric_flags – The metric flags.
Returns: The vector of GPU information.

GPUInfo &update(uint32_t index, GPUInfo &gpu_info, uint64_t metric_flags = GPUMetricFlag::DEFAULT)

Update the GPU information.

This function fills the GPU information given as the argument based on the given metric flags and returns the GPU information. If the metric flags are not provided, the existing metric flags are used.

Parameters

index – The GPU index.
gpu_info – The GPU information.
metric_flags – The metric flags.

Returns

The GPU information filled with the updated values (same as the argument).

GPUInfo gpu_info(uint32_t index, uint64_t metric_flags = GPUMetricFlag::DEFAULT)

Get the GPU information.

This method returns the GPU information based on the given index.

If the metric flags are provided, it returns the vector of GPU information based on the given metric flags. If the metric flags are not provided, it returns the cached GPU information.

Parameters

index – The GPU index.
metric_flags – The metric flags.

Returns

The GPU information.

std::vector<GPUInfo> gpu_info(uint64_t metric_flags = GPUMetricFlag::DEFAULT)

Get all GPU information.

This method returns the vector of GPU information. If the metric flags are provided, it returns the GPU information based on the given metric flags. If the metric flags are not provided, it returns the cached GPU information.

Parameters: metric_flags – The metric flags.
Returns: All GPU information.

uint32_t num_gpus() const

Get the number of GPUs.

Returns: The number of GPUs.

bool is_integrated_gpu(uint32_t index)

Check whether the GPU is integrated (iGPU)

Returns: True if the GPU is integrated (iGPU), false otherwise.

Protected Functions

bool bind_nvml_methods()

bool bind_cuda_runtime_methods()

bool init_nvml()

bool init_cuda_runtime()

void shutdown_nvml() noexcept

void shutdown_cuda_runtime() noexcept

Protected Attributes

void *handle_ = nullptr: The handle of the GPU resource monitor.

void *cuda_handle_ = nullptr: The handle of the CUDA Runtime library.

nvml::nvmlErrorString_t nvmlErrorString = nullptr: The function pointer to the nvmlErrorString function.

nvml::nvmlInit_t nvmlInit = nullptr: The function pointer to the nvmlInit function.

nvml::nvmlDeviceGetCount_t nvmlDeviceGetCount = nullptr: The function pointer to the nvmlDeviceGetCount function.

nvml::nvmlDeviceGetHandleByIndex_t nvmlDeviceGetHandleByIndex = nullptr: The function pointer to the nvmlDeviceGetHandleByIndex function.

nvml::nvmlDeviceGetHandleByPciBusId_t nvmlDeviceGetHandleByPciBusId = nullptr: The function pointer to the nvmlDeviceGetHandleByPciBusId function.

nvml::nvmlDeviceGetHandleBySerial_t nvmlDeviceGetHandleBySerial = nullptr: The function pointer to the nvmlDeviceGetHandleBySerial function.

nvml::nvmlDeviceGetHandleByUUID_t nvmlDeviceGetHandleByUUID = nullptr: The function pointer to the nvmlDeviceGetHandleByUUID function.

nvml::nvmlDeviceGetName_t nvmlDeviceGetName = nullptr: The function pointer to the nvmlDeviceGetName function.

nvml::nvmlDeviceGetIndex_t nvmlDeviceGetIndex = nullptr: The function pointer to the nvmlDeviceGetIndex function.

nvml::nvmlDeviceGetPciInfo_t nvmlDeviceGetPciInfo = nullptr: The function pointer to the nvmlDeviceGetPciInfo function.

nvml::nvmlDeviceGetSerial_t nvmlDeviceGetSerial = nullptr: The function pointer to the nvmlDeviceGetSerial function.

nvml::nvmlDeviceGetUUID_t nvmlDeviceGetUUID = nullptr: The function pointer to the nvmlDeviceGetUUID function.

nvml::nvmlDeviceGetMemoryInfo_t nvmlDeviceGetMemoryInfo = nullptr: The function pointer to the nvmlDeviceGetMemoryInfo function.

nvml::nvmlDeviceGetUtilizationRates_t nvmlDeviceGetUtilizationRates = nullptr: The function pointer to the nvmlDeviceGetUtilizationRates function.

nvml::nvmlDeviceGetPowerManagementLimit_t nvmlDeviceGetPowerManagementLimit = nullptr: The function pointer to the nvmlDeviceGetPowerManagementLimit function.

nvml::nvmlDeviceGetPowerUsage_t nvmlDeviceGetPowerUsage = nullptr: The function pointer to the nvmlDeviceGetPowerUsage function.

nvml::nvmlDeviceGetTemperature_t nvmlDeviceGetTemperature = nullptr: The function pointer to the nvmlDeviceGetTemperature function.

nvml::nvmlShutdown_t nvmlShutdown = nullptr: The function pointer to the nvmlShutdown function.

cuda::cudaGetErrorString_t cudaGetErrorString = nullptr: The function pointer to the cudaGetErrorString function.

cuda::cudaGetDeviceCount_t cudaGetDeviceCount = nullptr: The function pointer to the cudaGetDeviceCount function.

cuda::cudaGetDeviceProperties_t cudaGetDeviceProperties = nullptr: The function pointer to the cudaGetDeviceProperties function.

cuda::cudaDeviceGetPCIBusId_t cudaDeviceGetPCIBusId = nullptr: The function pointer to the cudaDeviceGetPCIBusId function.

cuda::cudaMemGetInfo_t cudaMemGetInfo = nullptr: The function pointer to the cudaMemGetInfo function.

uint64_t metric_flags_ = kDefaultGpuMetrics: The metric flags.

bool is_cached_ = false: The flag to indicate whether the GPU information is cached.

uint32_t gpu_count_ = 0: The cached number of GPUs.

std::vector<GPUInfo> gpu_info_: The cached GPU information.

std::vector<nvml::nvmlDevice_t> nvml_devices_: The cached NVML devices.

Previous Class GPUDevice

Next Template Class Graph