metrics
API Reference: v1/metrics.proto
The Metrics API contains metrics & telemetry data queries & messages.
Table of Contents
-
Services
-
Messages
Services
MetricsManagementService
MetricsManagementService is responsible for retrieving current policy state, power usage, and other telemetry data from the system.
Currently, all metrics are retrieved via Redfish. Please see the supplemental Metrics API documentation for additional information regarding challenges & limitations of this service.
For dpsctl usage for the MetricsManagementService, see the dpsctl metrics CLI guide.
GPUMetricsQuery
rpc GPUMetricsQuery(GPUMetricsQueryRequest) GPUMetricsQueryResponse
GPUMetricsQuery is used to query GPU power usage metrics from a list of node-GPU pairs.
MetricsRawQuery
rpc MetricsRawQuery(MetricsRawRequest) MetricsRawResponse
MetricsRawQuery is used to query current policy state, power usage, and other telemetry data from a single node. Metrics are returned in their raw state and may differ by device.
MetricsQuery
rpc MetricsQuery(MetricsQueryRequest) MetricsQueryResponse
MetricsQuery is used to retrieve standarized GPU, CPU, and memory usage metrics from a list of requested nodes.
Messages
GPUMetricsQueryRequest
GPUMetricsQueryRequest is used to query GPU power usage metrics from a given list of node-GPU pairs.
Each pair must specify the node name & GPU ID.
If you are unsure of the GPU IDs available on the node,
you may retrieve the number of available GPUs using the MetricsRawQuery API.
The available number GPUs are listed under the AvblNoGPU property, with IDs starting
at zero and ranging to AvblNoGPU - 1.
Example:
{
"gpus": [
{
"node": "viking592",
"gpu": 0
},
{
"node": "viking592",
"gpu": 1
},
{
"node": "viking592",
"gpu": 2
},
{
"node": "viking593",
"gpu": 0
}
]
}| Field | Type | Description |
|---|---|---|
| gpus | repeated GPUMetricsQueryRequest.GpuMetricsRequest | List of node-GPU pairs to query |
GPUMetricsQueryRequest.GpuMetricsRequest
| Field | Type | Description |
|---|---|---|
| node | string | Name of the node entity within the configured Topology |
| gpu | uint32 | The GPU ID on the node. IDs typically range from 0 to 7 |
GPUMetricsQueryResponse
{
"metrics:" [
{
"node": "viking592",
"gpu": 0,
"usage": 700.0
},
{
"node": "viking592",
"gpu": 1,
"usage": 698.2
},
{
"node": "viking592",
"gpu": 2,
"usage": 700.0
},
{
"node": "viking593",
"gpu": 0,
"usage": 374.5
},
]
}| Field | Type | Description |
|---|---|---|
| metrics | repeated GPUMetricsQueryResponse.GpuMetrics | A list of node-GPU pairs with the corresponding GPU power usage |
GPUMetricsQueryResponse.GpuMetrics
| Field | Type | Description |
|---|---|---|
| node | string | Name of the node entity |
| gpu | uint32 | GPU ID |
| usage | double | Power usage (in watts) |
MetricsQueryRequest
MetricsQueryRequest is used to query standardized usage metrics from the requested list of nodes.
Example:
{
"nodes": [
"node001",
"node002",
"node003",
]
}| Field | Type | Description |
|---|---|---|
| nodes | repeated string | nodes is the list of nodes to retrieve metrics from. |
MetricsQueryResponse
MetricsQueryResponse contains standardized usage metrics for GPU, CPU, and memory from the requested nodes.
| Field | Type | Description |
|---|---|---|
| nodes | repeated NodeMetricsResponse | List of node metrics responses, one for each requested node |
MetricsRawRequest
MetricsRawRequest used to query node power usage & other telemetry metrics from the requested node.
Example:
{
"node": "viking592"
}| Field | Type | Description |
|---|---|---|
| node | string | Name of the node entity |
MetricsRawResponse
MetricsRawResponse returns a map of telemetry metrics values for the requested node.
View the metrics definitions here.
Example:
{
"AvblNoCPU": "2",
"AvblNoGPU": "8",
"CPU_ENERGY_UNIT": "14",
"CPU_PWR_UNIT": "3",
"CPU_TIM_UNIT": "10",
"DIMM_Count_Socket_0": "16.00",
"DIMM_Count_Socket_1": "16.00",
"DIMM_Count_Total": "32",
"Domain_Policy_Active": "1",
"FixPwrAverage": "1099.00",
"FixPwrDGXAvg": "871.00",
"FixPwrHGXAvg": "228.00",
"GPU_PWR_BRAKE": "0",
"GPU_PWR_PRSNT": "1",
"PSU_Active_Policy": "4",
"PSU_Redundancy_Policy": "0",
"PSU_WORKING_CNT": "6",
"coreEfficiency_0": "99906.00",
"coreEfficiency_1": "214150.00",
"cpuEnergy_0": "195.00",
"cpuEnergy_1": "195.00",
"cpuPackagePowerCapabilitiesMax_0": "350",
"cpuPackagePowerCapabilitiesMax_1": "350",
"cpuPackagePowerCapabilitiesMin_0": "209",
"cpuPackagePowerCapabilitiesMin_1": "209",
"cpuPackagePowerLimit1_0": "400.00",
"cpuPackagePowerLimit1_1": "400.00",
"cpuPackagePowerLimit2_0": "400.00",
"cpuPackagePowerLimit2_1": "400.00",
"cpuPackagePower_avg_0": "195.00",
"cpuPackagePower_avg_1": "195.00",
"dcPlatformEnergy": "2069.00",
"dcPlatformPowerDGX_avg": "1321.00",
"dcPlatformPowerHGX_avg": "753.00",
"dcPlatformPowerLimit1": "0.00",
"dcPlatformPowerLimit2": "0.00",
"dcPlatformPower_avg": "2074.00",
"dramEnergy_0": "28.00",
"dramEnergy_1": "37.00",
"dramPackagePowerCapabilitiesMax_0": "122.00",
"dramPackagePowerCapabilitiesMax_1": "122.00",
"dramPackagePowerCapabilitiesMin_0": "0.00",
"dramPackagePowerCapabilitiesMin_1": "0.00",
"dramPowerLimit_0": "300.00",
"dramPowerLimit_1": "300.00",
"dramPower_avg_0": "28.00",
"dramPower_avg_1": "37.00",
"gpuPowerCapabilitiesMax_0": "700.00",
"gpuPowerCapabilitiesMax_1": "700.00",
"gpuPowerCapabilitiesMax_2": "700.00",
"gpuPowerCapabilitiesMax_3": "700.00",
"gpuPowerCapabilitiesMax_4": "700.00",
"gpuPowerCapabilitiesMax_5": "700.00",
"gpuPowerCapabilitiesMax_6": "700.00",
"gpuPowerCapabilitiesMax_7": "700.00",
"gpuPowerCapabilitiesMin_0": "200.00",
"gpuPowerCapabilitiesMin_1": "200.00",
"gpuPowerCapabilitiesMin_2": "200.00",
"gpuPowerCapabilitiesMin_3": "200.00",
"gpuPowerCapabilitiesMin_4": "200.00",
"gpuPowerCapabilitiesMin_5": "200.00",
"gpuPowerCapabilitiesMin_6": "200.00",
"gpuPowerCapabilitiesMin_7": "200.00",
"gpuPowerLimit_0": "700.00",
"gpuPowerLimit_1": "700.00",
"gpuPowerLimit_2": "700.00",
"gpuPowerLimit_3": "700.00",
"gpuPowerLimit_4": "700.00",
"gpuPowerLimit_5": "700.00",
"gpuPowerLimit_6": "700.00",
"gpuPowerLimit_7": "700.00",
"gpuPower_avg_0": "64.00",
"gpuPower_avg_1": "66.00",
"gpuPower_avg_2": "65.00",
"gpuPower_avg_3": "65.00",
"gpuPower_avg_4": "64.00",
"gpuPower_avg_5": "64.00",
"gpuPower_avg_6": "70.00",
"gpuPower_avg_7": "65.00",
"prochotRatioCapabilitiesMax_0": "2000",
"prochotRatioCapabilitiesMax_1": "2000",
"prochotRatioCapabilitiesMin_0": "500",
"prochotRatioCapabilitiesMin_1": "500",
"turboRatioCapabilitiesMax_0": "3800",
"turboRatioCapabilitiesMax_1": "3800",
"turboRatioCapabilitiesMin_0": "500",
"turboRatioCapabilitiesMin_1": "500"
}| Field | Type | Description |
|---|---|---|
| metrics | google.protobuf.Any | A map of telemetry & usage data for the node |
NodeMetricsResponse
NodeMetricsResponse contains aggregated metrics for all components on a single node
| Field | Type | Description |
|---|---|---|
| name | string | Name of the node |
| num_gpus | uint32 | Number of GPUs present on the node |
| num_cpus | uint32 | Number of CPUs present on the node |
| num_memory_units | uint32 | Number of memory units (DIMMs) present on the node |
| power_usage | double | none |
| gpus | repeated NodeMetricsResponse.ComponentUsage | List of GPU usage metrics for each GPU on the node |
| cpus | repeated NodeMetricsResponse.ComponentUsage | List of CPU usage metrics for each CPU on the node |
| memory | repeated NodeMetricsResponse.ComponentUsage | List of memory usage metrics for each memory unit on the node |
NodeMetricsResponse.ComponentUsage
ComponentUsage contains power and thermal metrics for a single Component. Where not available, metrics will be zero.
| Field | Type | Description |
|---|---|---|
| id | string | Unique identifier for the GPU on the node (typically 0-based) |
| power_usage | double | Current power consumption in watts |
| energy_usage | double | Cumulative energy consumption since last reset |
| energy_usage_unit | string | Unit for energy consumption (e.g., “Joules”) |
| set_limit | double | Currently set power limit in watts |
| max_limit | double | Maximum allowed power limit in watts |
| min_limit | double | Minimum allowed power limit in watts |
| temperature_celsius | double | Current temperature in Celsius |