Class GPUResourceMonitor
Defined in File gpu_resource_monitor.hpp
-
class GPUResourceMonitor
GPUResourceMonitor class.
This class is responsible for monitoring the GPU resources. It provides the information about the GPU resources (through
holoscan::GPUInfo
) to the SystemResourceManager class.The following
holoscan::GPUMetricFlag
flags are supported:DEFAULT
: Default GPU metrics (GPU_DEVICE_ID
)GPU_DEVICE_ID
: GPU device ID (name, pci, serial, uuid)GPU_UTILIZATION
: GPU utilization (gpu_utilization, memory_utilization)MEMORY_USAGE
: GPU memory usage (memory_total, memory_free, memory_used, memory_usage)POWER_LIMIT
: GPU power limit (power_limit)POWER_USAGE
: GPU power usage (power_usage)TEMPERATURE
: GPU temperature (temperature)ALL
: All GPU metrics above
index
information is always available.This uses the NVML library to get the GPU information. If NVML library is not available (in case of iGPU), this class uses the CUDA Runtime API to get the GPU information.
The following information is not available when using the CUDA Runtime API:
GPU_DEVICE_ID
:pci.pciDeviceId
andpci.pciSubSystemId
are not availableGPU_UTILIZATION
:gpu_utilization
andmemory_utilization
are not availablePOWER_LIMIT
:power_limit
is not availablePOWER_USAGE
:power_usage
is not availableTEMPERATURE
:temperature
is not available
Example:
#include <holoscan/core/system/system_resource_manager.hpp> #include <holoscan/logger/logger.hpp> ... holoscan::GPUResourceMonitor gpu_resource_monitor; gpu_resource_monitor.update(holoscan::GPUMetricFlag::ALL); auto gpu_info = gpu_resource_monitor.gpu_info(); auto gpu_count = gpu_resource_monitor.num_gpus(); for (int i = 0; i < gpu_count; i++) { // Print GPU information (GPUInfo) HOLOSCAN_LOG_INFO("GPU {} is available", gpu_info[i].index); HOLOSCAN_LOG_INFO("GPU {} name: {}", i, gpu_info[i].name); HOLOSCAN_LOG_INFO("GPU {} is iGPU: {}", i, gpu_info[i].is_integrated); HOLOSCAN_LOG_INFO("GPU {} pci.busId: {}", i, gpu_info[i].pci.busId); HOLOSCAN_LOG_INFO("GPU {} pci.busIdLegacy: {}", i, gpu_info[i].pci.busIdLegacy); HOLOSCAN_LOG_INFO("GPU {} pci.domain: {}", i, gpu_info[i].pci.domain); HOLOSCAN_LOG_INFO("GPU {} pci.bus: {}", i, gpu_info[i].pci.bus); HOLOSCAN_LOG_INFO("GPU {} pci.device: {}", i, gpu_info[i].pci.device); HOLOSCAN_LOG_INFO("GPU {} pci.pciDeviceId: {:x}:{:x}", i, gpu_info[i].pci.pciDeviceId & 0xffff, gpu_info[i].pci.pciDeviceId >> 16); HOLOSCAN_LOG_INFO("GPU {} pci.pciSubSystemId: {:x}:{:x}", i, gpu_info[i].pci.pciSubSystemId & 0xffff, gpu_info[i].pci.pciSubSystemId >> 16); HOLOSCAN_LOG_INFO("GPU {} serial: {}", i, gpu_info[i].serial); HOLOSCAN_LOG_INFO("GPU {} uuid: {}", i, gpu_info[i].uuid); HOLOSCAN_LOG_INFO("GPU {} gpu_utilization: {}", i, gpu_info[i].gpu_utilization); HOLOSCAN_LOG_INFO("GPU {} memory_utilization: {}", i, gpu_info[i].memory_utilization); HOLOSCAN_LOG_INFO("GPU {} memory_total: {}", i, gpu_info[i].memory_total); HOLOSCAN_LOG_INFO("GPU {} memory_free: {}", i, gpu_info[i].memory_free); HOLOSCAN_LOG_INFO("GPU {} memory_used: {}", i, gpu_info[i].memory_used); HOLOSCAN_LOG_INFO("GPU {} memory_usage: {}", i, gpu_info[i].memory_usage); HOLOSCAN_LOG_INFO("GPU {} power_limit: {}", i, gpu_info[i].power_limit); HOLOSCAN_LOG_INFO("GPU {} power_usage: {}", i, gpu_info[i].power_usage); HOLOSCAN_LOG_INFO("GPU {} temperature: {}", i, gpu_info[i].temperature); }
Public Functions
-
explicit GPUResourceMonitor(uint64_t metric_flags = kDefaultGpuMetrics)
Construct a new GPUResourceMonitor object.
This constructor creates a new GPUResourceMonitor object.
- Parameters
metric_flags – The metric flags (default:
GPU_DEVICE_ID
)
-
virtual ~GPUResourceMonitor()
-
void init()
Initialize the GPU resource monitor.
-
void close()
Close handle of the GPU resource monitor.
This function closes the handle of the opened NVML and CUDA Runtime libraries if they are open.
-
uint64_t metric_flags() const
Get metric flags.
This function returns the metric flags.
- Returns
The metric flags.
-
void metric_flags(uint64_t metric_flags)
Set metric flags.
This function sets the metric flags.
- Parameters
metric_flags – The metric flags
-
GPUInfo update(uint32_t index, uint64_t metric_flags = GPUMetricFlag::DEFAULT)
Update the GPU information and cache it.
This function updates information for the GPU with the given index based on the given metric flags and returns the GPU information. If the metric flags are not provided, the existing metric flags are used. It also caches the GPU information.
- Parameters
index – The GPU index.
metric_flags – The metric flags.
- Returns
The GPU information.
-
std::vector<GPUInfo> update(uint64_t metric_flags = GPUMetricFlag::DEFAULT)
Update all GPU information and cache it.
This function updates the information for all GPUs based on the given metric flags and returns a vector of GPU information. If the metric flags are not provided, the existing metric flags are used. It also caches the GPU information.
- Parameters
metric_flags – The metric flags.
- Returns
The vector of GPU information.
-
GPUInfo &update(uint32_t index, GPUInfo &gpu_info, uint64_t metric_flags = GPUMetricFlag::DEFAULT)
Update the GPU information.
This function fills the GPU information given as the argument based on the given metric flags and returns the GPU information. If the metric flags are not provided, the existing metric flags are used.
- Parameters
index – The GPU index.
gpu_info – The GPU information.
metric_flags – The metric flags.
- Returns
The GPU information filled with the updated values (same as the argument).
-
GPUInfo gpu_info(uint32_t index, uint64_t metric_flags = GPUMetricFlag::DEFAULT)
Get the GPU information.
This method returns the GPU information based on the given index.
If the metric flags are provided, it returns the vector of GPU information based on the given metric flags. If the metric flags are not provided, it returns the cached GPU information.
- Parameters
index – The GPU index.
metric_flags – The metric flags.
- Returns
The GPU information.
-
std::vector<GPUInfo> gpu_info(uint64_t metric_flags = GPUMetricFlag::DEFAULT)
Get all GPU information.
This method returns the vector of GPU information. If the metric flags are provided, it returns the GPU information based on the given metric flags. If the metric flags are not provided, it returns the cached GPU information.
- Parameters
metric_flags – The metric flags.
- Returns
All GPU information.
-
uint32_t num_gpus() const
Get the number of GPUs.
- Returns
The number of GPUs.
-
bool is_integrated_gpu(uint32_t index)
Check whether the GPU is integrated (iGPU)
- Returns
True if the GPU is integrated (iGPU), false otherwise.
Protected Functions
-
bool bind_nvml_methods()
-
bool bind_cuda_runtime_methods()
-
bool init_nvml()
-
bool init_cuda_runtime()
-
void shutdown_nvml() noexcept
-
void shutdown_cuda_runtime() noexcept
Protected Attributes
-
void *handle_ = nullptr
The handle of the GPU resource monitor.
-
void *cuda_handle_ = nullptr
The handle of the CUDA Runtime library.
-
nvml::nvmlErrorString_t nvmlErrorString = nullptr
The function pointer to the nvmlErrorString function.
-
nvml::nvmlInit_t nvmlInit = nullptr
The function pointer to the nvmlInit function.
-
nvml::nvmlDeviceGetCount_t nvmlDeviceGetCount = nullptr
The function pointer to the nvmlDeviceGetCount function.
-
nvml::nvmlDeviceGetHandleByIndex_t nvmlDeviceGetHandleByIndex = nullptr
The function pointer to the nvmlDeviceGetHandleByIndex function.
-
nvml::nvmlDeviceGetHandleByPciBusId_t nvmlDeviceGetHandleByPciBusId = nullptr
The function pointer to the nvmlDeviceGetHandleByPciBusId function.
-
nvml::nvmlDeviceGetHandleBySerial_t nvmlDeviceGetHandleBySerial = nullptr
The function pointer to the nvmlDeviceGetHandleBySerial function.
-
nvml::nvmlDeviceGetHandleByUUID_t nvmlDeviceGetHandleByUUID = nullptr
The function pointer to the nvmlDeviceGetHandleByUUID function.
-
nvml::nvmlDeviceGetName_t nvmlDeviceGetName = nullptr
The function pointer to the nvmlDeviceGetName function.
-
nvml::nvmlDeviceGetIndex_t nvmlDeviceGetIndex = nullptr
The function pointer to the nvmlDeviceGetIndex function.
-
nvml::nvmlDeviceGetPciInfo_t nvmlDeviceGetPciInfo = nullptr
The function pointer to the nvmlDeviceGetPciInfo function.
-
nvml::nvmlDeviceGetSerial_t nvmlDeviceGetSerial = nullptr
The function pointer to the nvmlDeviceGetSerial function.
-
nvml::nvmlDeviceGetUUID_t nvmlDeviceGetUUID = nullptr
The function pointer to the nvmlDeviceGetUUID function.
-
nvml::nvmlDeviceGetMemoryInfo_t nvmlDeviceGetMemoryInfo = nullptr
The function pointer to the nvmlDeviceGetMemoryInfo function.
-
nvml::nvmlDeviceGetUtilizationRates_t nvmlDeviceGetUtilizationRates = nullptr
The function pointer to the nvmlDeviceGetUtilizationRates function.
-
nvml::nvmlDeviceGetPowerManagementLimit_t nvmlDeviceGetPowerManagementLimit = nullptr
The function pointer to the nvmlDeviceGetPowerManagementLimit function.
-
nvml::nvmlDeviceGetPowerUsage_t nvmlDeviceGetPowerUsage = nullptr
The function pointer to the nvmlDeviceGetPowerUsage function.
-
nvml::nvmlDeviceGetTemperature_t nvmlDeviceGetTemperature = nullptr
The function pointer to the nvmlDeviceGetTemperature function.
-
nvml::nvmlShutdown_t nvmlShutdown = nullptr
The function pointer to the nvmlShutdown function.
-
cuda::cudaGetErrorString_t cudaGetErrorString = nullptr
The function pointer to the cudaGetErrorString function.
-
cuda::cudaGetDeviceCount_t cudaGetDeviceCount = nullptr
The function pointer to the cudaGetDeviceCount function.
-
cuda::cudaGetDeviceProperties_t cudaGetDeviceProperties = nullptr
The function pointer to the cudaGetDeviceProperties function.
-
cuda::cudaDeviceGetPCIBusId_t cudaDeviceGetPCIBusId = nullptr
The function pointer to the cudaDeviceGetPCIBusId function.
-
cuda::cudaMemGetInfo_t cudaMemGetInfo = nullptr
The function pointer to the cudaMemGetInfo function.
-
uint64_t metric_flags_ = kDefaultGpuMetrics
The metric flags.
-
bool is_cached_ = false
The flag to indicate whether the GPU information is cached.
-
uint32_t gpu_count_ = 0
The cached number of GPUs.
-
std::vector<GPUInfo> gpu_info_
The cached GPU information.
-
std::vector<nvml::nvmlDevice_t> nvml_devices_
The cached NVML devices.