metrics

API Reference: v1/metrics.proto

The Metrics API contains metrics & telemetry data queries & messages.

Table of Contents

Services

MetricsManagementService

MetricsManagementService is responsible for retrieving current policy state, power usage, and other telemetry data from the system.

Currently, all metrics are retrieved via Redfish. Please see the supplemental Metrics API documentation for additional information regarding challenges & limitations of this service.

For dpsctl usage for the MetricsManagementService, see the dpsctl metrics CLI guide.

GPUMetricsQuery

rpc GPUMetricsQuery(GPUMetricsQueryRequest) GPUMetricsQueryResponse

GPUMetricsQuery is used to query GPU power usage metrics from a list of node-GPU pairs.

MetricsRawQuery

rpc MetricsRawQuery(MetricsRawRequest) MetricsRawResponse

MetricsRawQuery is used to query current policy state, power usage, and other telemetry data from a single node. Metrics are returned in their raw state and may differ by device.

MetricsQuery

rpc MetricsQuery(MetricsQueryRequest) MetricsQueryResponse

MetricsQuery is used to retrieve standarized GPU, CPU, and memory usage metrics from a list of requested nodes.

Messages

GPUMetricsQueryRequest

GPUMetricsQueryRequest is used to query GPU power usage metrics from a given list of node-GPU pairs.

Each pair must specify the node name & GPU ID.

If you are unsure of the GPU IDs available on the node, you may retrieve the number of available GPUs using the MetricsRawQuery API. The available number GPUs are listed under the AvblNoGPU property, with IDs starting at zero and ranging to AvblNoGPU - 1.

Example:

{
    "gpus": [
        {
            "node": "viking592",
            "gpu": 0
        },
        {
            "node": "viking592",
            "gpu": 1
        },
        {
            "node": "viking592",
            "gpu": 2
        },
        {
            "node": "viking593",
            "gpu": 0
        }
    ]
}
Field Type Description
gpus repeated GPUMetricsQueryRequest.GpuMetricsRequest List of node-GPU pairs to query

GPUMetricsQueryRequest.GpuMetricsRequest

Field Type Description
node string Name of the node entity within the configured Topology
gpu uint32 The GPU ID on the node. IDs typically range from 0 to 7

GPUMetricsQueryResponse

{
  "metrics:" [
    {
      "node": "viking592",
      "gpu": 0,
      "usage": 700.0
    },
    {
      "node": "viking592",
      "gpu": 1,
      "usage": 698.2
    },
    {
      "node": "viking592",
      "gpu": 2,
      "usage": 700.0
    },
    {
      "node": "viking593",
      "gpu": 0,
      "usage": 374.5
    },
  ]
}
Field Type Description
metrics repeated GPUMetricsQueryResponse.GpuMetrics A list of node-GPU pairs with the corresponding GPU power usage

GPUMetricsQueryResponse.GpuMetrics

Field Type Description
node string Name of the node entity
gpu uint32 GPU ID
usage double Power usage (in watts)

MetricsQueryRequest

MetricsQueryRequest is used to query standardized usage metrics from the requested list of nodes.

Example:

{
    "nodes": [
        "node001",
        "node002",
        "node003",
    ]
}
Field Type Description
nodes repeated string nodes is the list of nodes to retrieve metrics from.

MetricsQueryResponse

MetricsQueryResponse contains standardized usage metrics for GPU, CPU, and memory from the requested nodes.

Field Type Description
nodes repeated NodeMetricsResponse List of node metrics responses, one for each requested node

MetricsRawRequest

MetricsRawRequest used to query node power usage & other telemetry metrics from the requested node.

Example:

{
   "node": "viking592"
}
Field Type Description
node string Name of the node entity

MetricsRawResponse

MetricsRawResponse returns a map of telemetry metrics values for the requested node.

View the metrics definitions here.

Example:

{
  "AvblNoCPU": "2",
  "AvblNoGPU": "8",
  "CPU_ENERGY_UNIT": "14",
  "CPU_PWR_UNIT": "3",
  "CPU_TIM_UNIT": "10",
  "DIMM_Count_Socket_0": "16.00",
  "DIMM_Count_Socket_1": "16.00",
  "DIMM_Count_Total": "32",
  "Domain_Policy_Active": "1",
  "FixPwrAverage": "1099.00",
  "FixPwrDGXAvg": "871.00",
  "FixPwrHGXAvg": "228.00",
  "GPU_PWR_BRAKE": "0",
  "GPU_PWR_PRSNT": "1",
  "PSU_Active_Policy": "4",
  "PSU_Redundancy_Policy": "0",
  "PSU_WORKING_CNT": "6",
  "coreEfficiency_0": "99906.00",
  "coreEfficiency_1": "214150.00",
  "cpuEnergy_0": "195.00",
  "cpuEnergy_1": "195.00",
  "cpuPackagePowerCapabilitiesMax_0": "350",
  "cpuPackagePowerCapabilitiesMax_1": "350",
  "cpuPackagePowerCapabilitiesMin_0": "209",
  "cpuPackagePowerCapabilitiesMin_1": "209",
  "cpuPackagePowerLimit1_0": "400.00",
  "cpuPackagePowerLimit1_1": "400.00",
  "cpuPackagePowerLimit2_0": "400.00",
  "cpuPackagePowerLimit2_1": "400.00",
  "cpuPackagePower_avg_0": "195.00",
  "cpuPackagePower_avg_1": "195.00",
  "dcPlatformEnergy": "2069.00",
  "dcPlatformPowerDGX_avg": "1321.00",
  "dcPlatformPowerHGX_avg": "753.00",
  "dcPlatformPowerLimit1": "0.00",
  "dcPlatformPowerLimit2": "0.00",
  "dcPlatformPower_avg": "2074.00",
  "dramEnergy_0": "28.00",
  "dramEnergy_1": "37.00",
  "dramPackagePowerCapabilitiesMax_0": "122.00",
  "dramPackagePowerCapabilitiesMax_1": "122.00",
  "dramPackagePowerCapabilitiesMin_0": "0.00",
  "dramPackagePowerCapabilitiesMin_1": "0.00",
  "dramPowerLimit_0": "300.00",
  "dramPowerLimit_1": "300.00",
  "dramPower_avg_0": "28.00",
  "dramPower_avg_1": "37.00",
  "gpuPowerCapabilitiesMax_0": "700.00",
  "gpuPowerCapabilitiesMax_1": "700.00",
  "gpuPowerCapabilitiesMax_2": "700.00",
  "gpuPowerCapabilitiesMax_3": "700.00",
  "gpuPowerCapabilitiesMax_4": "700.00",
  "gpuPowerCapabilitiesMax_5": "700.00",
  "gpuPowerCapabilitiesMax_6": "700.00",
  "gpuPowerCapabilitiesMax_7": "700.00",
  "gpuPowerCapabilitiesMin_0": "200.00",
  "gpuPowerCapabilitiesMin_1": "200.00",
  "gpuPowerCapabilitiesMin_2": "200.00",
  "gpuPowerCapabilitiesMin_3": "200.00",
  "gpuPowerCapabilitiesMin_4": "200.00",
  "gpuPowerCapabilitiesMin_5": "200.00",
  "gpuPowerCapabilitiesMin_6": "200.00",
  "gpuPowerCapabilitiesMin_7": "200.00",
  "gpuPowerLimit_0": "700.00",
  "gpuPowerLimit_1": "700.00",
  "gpuPowerLimit_2": "700.00",
  "gpuPowerLimit_3": "700.00",
  "gpuPowerLimit_4": "700.00",
  "gpuPowerLimit_5": "700.00",
  "gpuPowerLimit_6": "700.00",
  "gpuPowerLimit_7": "700.00",
  "gpuPower_avg_0": "64.00",
  "gpuPower_avg_1": "66.00",
  "gpuPower_avg_2": "65.00",
  "gpuPower_avg_3": "65.00",
  "gpuPower_avg_4": "64.00",
  "gpuPower_avg_5": "64.00",
  "gpuPower_avg_6": "70.00",
  "gpuPower_avg_7": "65.00",
  "prochotRatioCapabilitiesMax_0": "2000",
  "prochotRatioCapabilitiesMax_1": "2000",
  "prochotRatioCapabilitiesMin_0": "500",
  "prochotRatioCapabilitiesMin_1": "500",
  "turboRatioCapabilitiesMax_0": "3800",
  "turboRatioCapabilitiesMax_1": "3800",
  "turboRatioCapabilitiesMin_0": "500",
  "turboRatioCapabilitiesMin_1": "500"
}
Field Type Description
metrics google.protobuf.Any A map of telemetry & usage data for the node

NodeMetricsResponse

NodeMetricsResponse contains aggregated metrics for all components on a single node

Field Type Description
name string Name of the node
num_gpus uint32 Number of GPUs present on the node
num_cpus uint32 Number of CPUs present on the node
num_memory_units uint32 Number of memory units (DIMMs) present on the node
power_usage double none
gpus repeated NodeMetricsResponse.ComponentUsage List of GPU usage metrics for each GPU on the node
cpus repeated NodeMetricsResponse.ComponentUsage List of CPU usage metrics for each CPU on the node
memory repeated NodeMetricsResponse.ComponentUsage List of memory usage metrics for each memory unit on the node

NodeMetricsResponse.ComponentUsage

ComponentUsage contains power and thermal metrics for a single Component. Where not available, metrics will be zero.

Field Type Description
id string Unique identifier for the GPU on the node (typically 0-based)
power_usage double Current power consumption in watts
energy_usage double Cumulative energy consumption since last reset
energy_usage_unit string Unit for energy consumption (e.g., “Joules”)
set_limit double Currently set power limit in watts
max_limit double Maximum allowed power limit in watts
min_limit double Minimum allowed power limit in watts
temperature_celsius double Current temperature in Celsius

Scalar Value Types

.proto Type Notes C++ Type Java Type Python Type

double
double double float

float
float float float

int32
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int

int64
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long

uint32
Uses variable-length encoding. uint32 int int/long

uint64
Uses variable-length encoding. uint64 long int/long

sint32
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int

sint64
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long

fixed32
Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int

fixed64
Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long

sfixed32
Always four bytes. int32 int int

sfixed64
Always eight bytes. int64 long int/long

bool
bool boolean boolean

string
A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode

bytes
May contain any arbitrary sequence of bytes. string ByteString str