holoscan::GPUResourceMonitor

Beta
View as Markdown

GPUResourceMonitor class.

This class is responsible for monitoring the GPU resources. It provides the information about the GPU resources (through holoscan::GPUInfo) to the SystemResourceManager class.

The following holoscan::GPUMetricFlag flags are supported:

  • DEFAULT: Default GPU metrics (GPU_DEVICE_ID)
  • GPU_DEVICE_ID: GPU device ID (name, pci, serial, uuid)
  • GPU_UTILIZATION: GPU utilization (gpu_utilization, memory_utilization)
  • MEMORY_USAGE: GPU memory usage (memory_total, memory_free, memory_used, memory_usage)
  • POWER_LIMIT: GPU power limit (power_limit)
  • POWER_USAGE: GPU power usage (power_usage)
  • TEMPERATURE: GPU temperature (temperature)
  • ALL: All GPU metrics above

index information is always available.

This uses the NVML library to get the GPU information. If NVML library is not available (in case of iGPU), this class uses the CUDA Runtime API to get the GPU information.

The following information is not available when using the CUDA Runtime API:

  • GPU_DEVICE_ID: pci.pciDeviceId and pci.pciSubSystemId are not available
  • GPU_UTILIZATION: gpu_utilization and memory_utilization are not available
  • POWER_LIMIT: power_limit is not available
  • POWER_USAGE: power_usage is not available
  • TEMPERATURE: temperature is not available

Example:

#include <holoscan/gpu_resource_monitor.hpp>

Example

#include <holoscan/core/system/system_resource_manager.hpp>
#include <holoscan/logger/logger.hpp>
...
holoscan::GPUResourceMonitor gpu_resource_monitor;
gpu_resource_monitor.update(holoscan::GPUMetricFlag::ALL);
auto gpu_info = gpu_resource_monitor.gpu_info();
auto gpu_count = gpu_resource_monitor.num_gpus();
for (int i = 0; i < gpu_count; i++) {
// Print GPU information (GPUInfo)
HOLOSCAN_LOG_INFO("GPU {} is available", gpu_info[i].index);
HOLOSCAN_LOG_INFO("GPU {} name: {}", i, gpu_info[i].name);
HOLOSCAN_LOG_INFO("GPU {} is iGPU: {}", i, gpu_info[i].is_integrated);
HOLOSCAN_LOG_INFO("GPU {} pci.busId: {}", i, gpu_info[i].pci.busId);
HOLOSCAN_LOG_INFO("GPU {} pci.busIdLegacy: {}", i, gpu_info[i].pci.busIdLegacy);
HOLOSCAN_LOG_INFO("GPU {} pci.domain: {}", i, gpu_info[i].pci.domain);
HOLOSCAN_LOG_INFO("GPU {} pci.bus: {}", i, gpu_info[i].pci.bus);
HOLOSCAN_LOG_INFO("GPU {} pci.device: {}", i, gpu_info[i].pci.device);
HOLOSCAN_LOG_INFO("GPU {} pci.pciDeviceId: {:x}:{:x}",
i,
gpu_info[i].pci.pciDeviceId & 0xffff,
gpu_info[i].pci.pciDeviceId >> 16);
HOLOSCAN_LOG_INFO("GPU {} pci.pciSubSystemId: {:x}:{:x}",
i,
gpu_info[i].pci.pciSubSystemId & 0xffff,
gpu_info[i].pci.pciSubSystemId >> 16);
HOLOSCAN_LOG_INFO("GPU {} serial: {}", i, gpu_info[i].serial);
HOLOSCAN_LOG_INFO("GPU {} uuid: {}", i, gpu_info[i].uuid);
HOLOSCAN_LOG_INFO("GPU {} gpu_utilization: {}", i, gpu_info[i].gpu_utilization);
HOLOSCAN_LOG_INFO("GPU {} memory_utilization: {}", i, gpu_info[i].memory_utilization);
HOLOSCAN_LOG_INFO("GPU {} memory_total: {}", i, gpu_info[i].memory_total);
HOLOSCAN_LOG_INFO("GPU {} memory_free: {}", i, gpu_info[i].memory_free);
HOLOSCAN_LOG_INFO("GPU {} memory_used: {}", i, gpu_info[i].memory_used);
HOLOSCAN_LOG_INFO("GPU {} memory_usage: {}", i, gpu_info[i].memory_usage);
HOLOSCAN_LOG_INFO("GPU {} power_limit: {}", i, gpu_info[i].power_limit);
HOLOSCAN_LOG_INFO("GPU {} power_usage: {}", i, gpu_info[i].power_usage);
HOLOSCAN_LOG_INFO("GPU {} temperature: {}", i, gpu_info[i].temperature);
}

Constructors

GPUResourceMonitor

holoscan::GPUResourceMonitor::GPUResourceMonitor(holoscan::GPUResourceMonitor::GPUResourceMonitor(
uint64_t metric_flags = kDefaultGpuMetrics
)

Construct a new GPUResourceMonitor object.

This constructor creates a new GPUResourceMonitor object.

Parameters

metric_flags
uint64_tDefaults to kDefaultGpuMetrics

The metric flags (default: GPU_DEVICE_ID)

Destructor

~GPUResourceMonitor

virtual holoscan::GPUResourceMonitor::~GPUResourceMonitor()virtual holoscan::GPUResourceMonitor::~GPUResourceMonitor()

Methods

init

void holoscan::GPUResourceMonitor::init()

Initialize the GPU resource monitor.

close

void holoscan::GPUResourceMonitor::close()

Close handle of the GPU resource monitor.

This function closes the handle of the opened NVML and CUDA Runtime libraries if they are open.

metric_flags

void holoscan::GPUResourceMonitor::metric_flags(
uint64_t metric_flags
)

Set metric flags.

This function sets the metric flags.

Parameters

metric_flags
uint64_t

The metric flags

update

GPUInfo holoscan::GPUResourceMonitor::update(GPUInfo holoscan::GPUResourceMonitor::update(
uint32_t index,
uint64_t metric_flags = GPUMetricFlag::DEFAULT
)

Update the GPU information and cache it.

This function updates information for the GPU with the given index based on the given metric flags and returns the GPU information. If the metric flags are not provided, the existing metric flags are used. It also caches the GPU information.

Returns: The GPU information.

Parameters

index
uint32_t

The GPU index.

metric_flags
uint64_tDefaults to GPUMetricFlag::DEFAULT

The metric flags.

gpu_info

GPUInfo holoscan::GPUResourceMonitor::gpu_info(GPUInfo holoscan::GPUResourceMonitor::gpu_info(
uint32_t index,
uint64_t metric_flags = GPUMetricFlag::DEFAULT
)

Get the GPU information.

This method returns the GPU information based on the given index.

If the metric flags are provided, it returns the vector of GPU information based on the given metric flags. If the metric flags are not provided, it returns the cached GPU information.

Returns: The GPU information.

Parameters

index
uint32_t

The GPU index.

metric_flags
uint64_tDefaults to GPUMetricFlag::DEFAULT

The metric flags.

num_gpus

uint32_t holoscan::GPUResourceMonitor::num_gpus() const

Get the number of GPUs.

Returns: The number of GPUs.

is_integrated_gpu

bool holoscan::GPUResourceMonitor::is_integrated_gpu(
uint32_t index
)

Check whether the GPU is integrated (iGPU).

Returns: True if the GPU is integrated (iGPU), false otherwise.

bind_nvml_methods

bool holoscan::GPUResourceMonitor::bind_nvml_methods()

bind_cuda_runtime_methods

bool holoscan::GPUResourceMonitor::bind_cuda_runtime_methods()

init_nvml

bool holoscan::GPUResourceMonitor::init_nvml()

init_cuda_runtime

bool holoscan::GPUResourceMonitor::init_cuda_runtime()

shutdown_nvml

void holoscan::GPUResourceMonitor::shutdown_nvml() noexcept

shutdown_cuda_runtime

void holoscan::GPUResourceMonitor::shutdown_cuda_runtime() noexcept

Member variables

NameTypeDescription
handle_void *The handle of the GPU resource monitor.
cuda_handle_void *The handle of the CUDA Runtime library.
nvmlErrorStringnvml::nvmlErrorString_tThe function pointer to the nvmlErrorString function.
nvmlInitnvml::nvmlInit_tThe function pointer to the nvmlInit function.
nvmlDeviceGetCountnvml::nvmlDeviceGetCount_tThe function pointer to the nvmlDeviceGetCount function.
nvmlDeviceGetHandleByIndexnvml::nvmlDeviceGetHandleByIndex_tThe function pointer to the nvmlDeviceGetHandleByIndex function.
nvmlDeviceGetHandleByPciBusIdnvml::nvmlDeviceGetHandleByPciBusId_tThe function pointer to the nvmlDeviceGetHandleByPciBusId function.
nvmlDeviceGetHandleBySerialnvml::nvmlDeviceGetHandleBySerial_tThe function pointer to the nvmlDeviceGetHandleBySerial function.
nvmlDeviceGetHandleByUUIDnvml::nvmlDeviceGetHandleByUUID_tThe function pointer to the nvmlDeviceGetHandleByUUID function.
nvmlDeviceGetNamenvml::nvmlDeviceGetName_tThe function pointer to the nvmlDeviceGetName function.
nvmlDeviceGetIndexnvml::nvmlDeviceGetIndex_tThe function pointer to the nvmlDeviceGetIndex function.
nvmlDeviceGetPciInfonvml::nvmlDeviceGetPciInfo_tThe function pointer to the nvmlDeviceGetPciInfo function.
nvmlDeviceGetSerialnvml::nvmlDeviceGetSerial_tThe function pointer to the nvmlDeviceGetSerial function.
nvmlDeviceGetUUIDnvml::nvmlDeviceGetUUID_tThe function pointer to the nvmlDeviceGetUUID function.
nvmlDeviceGetMemoryInfonvml::nvmlDeviceGetMemoryInfo_tThe function pointer to the nvmlDeviceGetMemoryInfo function.
nvmlDeviceGetUtilizationRatesnvml::nvmlDeviceGetUtilizationRates_tThe function pointer to the nvmlDeviceGetUtilizationRates function.
nvmlDeviceGetPowerManagementLimitnvml::nvmlDeviceGetPowerManagementLimit_tThe function pointer to the nvmlDeviceGetPowerManagementLimit function.
nvmlDeviceGetPowerUsagenvml::nvmlDeviceGetPowerUsage_tThe function pointer to the nvmlDeviceGetPowerUsage function.
nvmlDeviceGetTemperaturenvml::nvmlDeviceGetTemperature_tThe function pointer to the nvmlDeviceGetTemperature function.
nvmlShutdownnvml::nvmlShutdown_tThe function pointer to the nvmlShutdown function.
cudaGetErrorStringcuda::cudaGetErrorString_tThe function pointer to the cudaGetErrorString function.
cudaGetDeviceCountcuda::cudaGetDeviceCount_tThe function pointer to the cudaGetDeviceCount function.
cudaGetDevicePropertiescuda::cudaGetDeviceProperties_tThe function pointer to the cudaGetDeviceProperties function.
cudaDeviceGetPCIBusIdcuda::cudaDeviceGetPCIBusId_tThe function pointer to the cudaDeviceGetPCIBusId function.
cudaMemGetInfocuda::cudaMemGetInfo_tThe function pointer to the cudaMemGetInfo function.
metric_flags_uint64_tThe metric flags.
is_cached_boolThe flag to indicate whether the GPU information is cached.
gpu_count_uint32_tThe cached number of GPUs.
gpu_info_std::vector< GPUInfo >The cached GPU information.
nvml_devices_std::vector< nvml::nvmlDevice_t >The cached NVML devices.