bmc_health
API Reference: v1/bmc_health.proto
BMC health check service messages.
Defines requests and responses used by the deep BMC pre-requisite health check, which validates Redfish endpoint reachability, firmware inventory, power setting controls, IB/OOB drift, EDPp/WPPS state, and telemetry across cluster nodes.
Table of Contents
-
Messages
- BMCHealthCheckHandle
- BMCHealthCheckProgress
- BMCHealthCheckReportRequest
- BMCHealthCheckReportResponse
- BMCHealthCheckRequest
- BMCHealthCheckResponse
- BMCHealthCheckStatus
- BMCHealthCheckStatusRequest
- ClusterSummary
- DurationStats
- EDPpRead
- EndpointStat
- FirmwareEntry
- IBOOBDrift
- Issue
- NodeReport
- PowerSettingRead
- PowerWriteCycleStat
- TelemetryComponentStat
- ValueStats
- WPPSRead
-
Enums
Messages
BMCHealthCheckHandle
BMCHealthCheckHandle is returned immediately from StartBMCHealthCheck. The task continues running in the background on the server; callers poll GetBMCHealthCheckStatus and GetBMCHealthCheckReport using task_id to observe progress and retrieve the final report.
task_id is stable for the lifetime of the task in the server’s task manager. After the task completes it remains queryable for at most –zombie-lifetime (default 10m), after which a status or report lookup returns NotFound. A server restart drops all task state, so callers must re-run on restart.
| Field | Type | Description |
|---|---|---|
| task_id | string | Unique task identifier; pass back to status / report RPCs. |
| name | string | Human-readable task name (e.g. “bmc_health_check”). |
| status | string | Initial status string from the task manager (e.g. “Started”). |
| started_at | google.protobuf.Timestamp | Server-side timestamp when the task was started. |
BMCHealthCheckProgress
BMCHealthCheckProgress reports how far the runner has advanced through the in-scope entities. Both counters reset to zero between runs; they are not durable.
| Field | Type | Description |
|---|---|---|
| nodes_total | int32 | Total entities to probe; populated once the task has loaded its entity set from the topology store (zero until then). |
| nodes_completed | int32 | Entities that have produced a response (real, synthesized timeout, or dispatch error). |
BMCHealthCheckReportRequest
BMCHealthCheckReportRequest fetches the final report. If the task is still running this returns done=false with no payload; callers should continue polling (or use the dpsctl –wait shortcut).
| Field | Type | Description |
|---|---|---|
| task_id | string | Task ID returned by StartBMCHealthCheck. |
| summary_only | bool | When true, the server strips per-node NodeReport entries from the response and returns only Status, ClusterSummary, and Issues. Useful on large clusters where the full per-node detail (firmware lists, endpoint stats, telemetry samples) inflates the payload past what an operator wants to scroll through, and for scripted polling where only the pass/fail summary matters. The underlying task is unaffected; a subsequent call with summary_only=false returns the full report (until the task exits the zombie window). |
BMCHealthCheckReportResponse
BMCHealthCheckReportResponse is the response of GetBMCHealthCheckReport.
| Field | Type | Description |
|---|---|---|
| done | bool | True when the underlying task is finished (success or failure). When false the report field is unset; poll again later. |
| status | string | Task manager status string at the time of the lookup (“Started” while running, “Done” / “Failed” once finished). |
| report | BMCHealthCheckResponse | Cluster-level result. Populated only when done=true and the task did not fail before any data could be collected. |
| error_message | string | Human-readable error string from the task’s terminal error, if any. Empty on a clean run. |
BMCHealthCheckRequest
BMCHealthCheckRequest is the request for running a deep BMC pre-requisite health check.
Concurrency note: the server enforces an at-most-one gate that is SERVER-WIDE, not topology- or node-scoped. While any BMC health check is in flight, a subsequent StartBMCHealthCheck for a different topology (or a disjoint Nodes subset) is rejected with AlreadyExists and the existing task_id. This is the intended first-version behavior because in practice topologies on the same server have a high node intersection and serializing protects shared BMCs from overlapping load and write-cycle interleaving. A future revision may loosen this to allow disjoint runs to proceed in parallel.
| Field | Type | Description |
|---|---|---|
| topology_name | string | Name of the topology whose nodes are to be checked. |
| nodes | repeated string | Optional explicit list of node names to check. If empty, all nodes in the topology are checked. |
| oneof _samples_per_telemetry.samples_per_telemetry | optional int32 | Optional number of telemetry samples to collect per node. |
| oneof _concurrency.concurrency | optional int32 | Optional maximum number of nodes to check concurrently. |
| oneof _skip_writes.skip_writes | optional bool | Optional flag to skip write/restore operations on BMCs. |
| oneof _expected_edpp_pct.expected_edpp_pct | optional int32 | Expected steady-state EDPp current value, in percent. The screen rule flags a processor whose EDPp current_pct is below this floor. Default (when unset) is 100. Allowed range when set: 1..200. |
| oneof _telemetry_interval_ms.telemetry_interval_ms | optional int32 | Sleep between telemetry samples on each node, in milliseconds. Zero means no inter-sample delay (samples issued back-to-back). Allowed range when set: 0..60_000 (1 minute). The dpsctl CLI defaults to 500ms; direct API callers that omit the field get the server-side default (no delay, matching the legacy back-to-back semantics). |
| oneof _write_resolution_timeout_ms.write_resolution_timeout_ms | optional int32 | Per-PATCH readback window for the write-cycle probe, in milliseconds. After each PATCH on PowerLimitWatts.SetPoint the probe polls the readback on a small backoff schedule until the value matches or this ceiling elapses. Zero (or unset) uses the bmcprobe default (1s). Operators on very slow BMC fleets can raise this, but values much above 5s tend to amplify failures on already-stuck BMCs rather than recover them. Allowed range when set: 0..60_000 (60s). |
BMCHealthCheckResponse
BMCHealthCheckResponse contains the results of a BMC pre-requisite health check.
| Field | Type | Description |
|---|---|---|
| status | Status | none |
| node_reports | repeated NodeReport | none |
| cluster_summary | ClusterSummary | none |
| issues | repeated Issue | none |
BMCHealthCheckStatus
BMCHealthCheckStatus is the response of GetBMCHealthCheckStatus.
| Field | Type | Description |
|---|---|---|
| task_id | string | Echoes the looked-up task_id. |
| name | string | Task name (constant for the lifetime of the task). |
| status | string | Task manager status string: “Init”, “Started”, “Done”, “Failed”. |
| started_at | google.protobuf.Timestamp | Time the task was started. |
| completed_at | google.protobuf.Timestamp | Time the task completed; unset while running. |
| diag_message | string | Diagnostic message when the task panicked. Empty otherwise. |
| progress | BMCHealthCheckProgress | Coarse progress counters. Available even before completion. |
| done | bool | True when the task has reached a terminal state (success, failure, or cancellation) and the report is ready to fetch via GetBMCHealthCheckReport. |
| error_message | string | Human-readable error string from the task’s terminal error, if any (e.g. a loadData failure). Empty when no error was recorded. |
BMCHealthCheckStatusRequest
BMCHealthCheckStatusRequest queries the current status of a previously started BMC health check.
| Field | Type | Description |
|---|---|---|
| task_id | string | Task ID returned by StartBMCHealthCheck. |
ClusterSummary
ClusterSummary aggregates pass/fail counts and latency statistics across the cluster.
| Field | Type | Description |
|---|---|---|
| total_nodes | int64 | Nodes in the topology scope at run start (the full set of entities the task intended to probe). Populated by the aggregator from its input view count, independent of how many entries actually carried a usable NodeReport. |
| passed | int64 | none |
| failed | int64 | none |
| endpoint_latency_ms | DurationStats | none |
| telemetry_latency_ms | DurationStats | none |
| power_get_latency_ms | DurationStats | Aggregate latency, in milliseconds, for power-limit GETs across all nodes/processors. |
| power_set_latency_ms | DurationStats | Aggregate latency, in milliseconds, for power-limit SETs across all nodes/processors. |
| edpp_get_latency_ms | DurationStats | Aggregate latency, in milliseconds, for EDPp percent reads across all nodes/processors. |
| wpps_get_latency_ms | DurationStats | Aggregate latency, in milliseconds, for WPPS endpoint reads across all nodes. |
| nodes_attempted | int32 | Nodes the runner actually invoked a probe against, i.e. entries that reached the screening pass with a non-nil NodeReport. May be smaller than total_nodes if the aggregator dropped malformed entries. |
| nodes_unreachable | int32 | Nodes for which probing failed before any sample was collected (BMC unreachable, auth, etc.). |
DurationStats
DurationStats captures latency statistics, in milliseconds.
| Field | Type | Description |
|---|---|---|
| count | int64 | none |
| min_ms | int64 | none |
| max_ms | int64 | none |
| avg_ms | int64 | none |
| p50_ms | int64 | none |
| p95_ms | int64 | none |
| p99_ms | int64 | none |
EDPpRead
EDPpRead captures the EDPp reading for a processor.
| Field | Type | Description |
|---|---|---|
| processor | string | none |
| current_pct | int32 | none |
| reference_pct | int32 | none |
| reference_available | bool | none |
EndpointStat
EndpointStat records request statistics for a single Redfish endpoint+method pair.
| Field | Type | Description |
|---|---|---|
| endpoint | string | none |
| method | string | none |
| attempts | int64 | none |
| successes | int64 | none |
| failures | int64 | none |
| duration | DurationStats | none |
FirmwareEntry
FirmwareEntry describes a single firmware inventory entry from the BMC.
The node attribution is denormalized onto every entry (rather than only living on the enclosing NodeReport) so the firmware list is usable when flattened across the cluster – e.g. when correlating firmware versions against failing-node Issues, or when emitting a per-node firmware summary into structured logs without re-walking the parent report.
| Field | Type | Description |
|---|---|---|
| id | string | Inventory ID as reported by the BMC (e.g. “BMC_0”, “HMC_0”, “ERoT_BMC_0”). Stable across probe runs. |
| version | string | Version string returned by the BMC. Empty when the inventory item existed but its version field could not be parsed. |
| node | string | Node name the firmware was read from. Always set by probeFirmware so consumers can filter the cluster’s firmware list by node without joining back to the parent NodeReport. |
IBOOBDrift
IBOOBDrift captures the drift between in-band and out-of-band power values.
| Field | Type | Description |
|---|---|---|
| component | string | none |
| set_point_w | int64 | none |
| oneshot_w | int64 | none |
| drift | bool | none |
Issue
Issue describes a single problem detected during the health check.
| Field | Type | Description |
|---|---|---|
| severity | Severity | none |
| code | string | none |
| node | string | none |
| resource | string | none |
| observed | string | none |
| threshold | string | none |
| message | string | none |
NodeReport
NodeReport summarizes the BMC pre-requisite health check results for a single node.
| Field | Type | Description |
|---|---|---|
| node | string | none |
| reachable | bool | none |
| error_msg | string | none |
| firmware | repeated FirmwareEntry | none |
| endpoint_stats | repeated EndpointStat | none |
| power_settings | repeated PowerSettingRead | none |
| power_writes | repeated PowerWriteCycleStat | none |
| ib_oob_drift | repeated IBOOBDrift | none |
| edpp | repeated EDPpRead | none |
| wpps | repeated WPPSRead | none |
| telemetry_samples | int64 | none |
| telemetry | repeated TelemetryComponentStat | none |
PowerSettingRead
PowerSettingRead captures the readable power setting state for a component.
| Field | Type | Description |
|---|---|---|
| component | string | none |
| set_point_w | int64 | none |
| allowable_max_w | int64 | none |
| default_set_point_w | int64 | none |
| power_w | double | none |
PowerWriteCycleStat
PowerWriteCycleStat captures the results of an exercised power write/restore cycle.
| Field | Type | Description |
|---|---|---|
| component | string | none |
| original_w | int64 | none |
| decrement_cycles | int32 | Number of decrement-by-1W iterations the probe attempted. |
| mismatches | int32 | Total readback / set mismatches observed across all decrement and increment iterations plus the final restore readback. |
| restored | bool | none |
| restored_value_w | int64 | none |
| restore_error | string | none |
| increment_cycles | int32 | Number of increment-by-1W iterations the probe attempted as part of walking the SetPoint back up to the captured original. |
TelemetryComponentStat
TelemetryComponentStat summarizes telemetry samples for a single component.
| Field | Type | Description |
|---|---|---|
| component | string | none |
| power_w | ValueStats | none |
ValueStats
ValueStats captures statistics over a numeric value (e.g. telemetry power).
| Field | Type | Description |
|---|---|---|
| count | int64 | none |
| min | double | none |
| max | double | none |
| mean | double | none |
| p50 | double | none |
| p95 | double | none |
| p99 | double | none |
WPPSRead
WPPSRead captures the workload power profile state for a processor.
| Field | Type | Description |
|---|---|---|
| processor | string | none |
| accessible | bool | none |
| enforced_mask | string | none |
| requested_mask | string | none |
| supported_mask | string | none |
| available_profile_ids | repeated int32 | none |
| error_msg | string | none |
| was_active | bool | True when EnforcedProfileMask was non-zero on probe – i.e. one or more workload power profiles were active on this processor at probe time. Lights-on validation requires no active profiles; the screen pass converts this into a WPPS_PROFILES_ACTIVE Issue. |
| active_profile_ids | repeated int32 | The profile IDs decoded from the original EnforcedProfileMask, in ascending order. Empty when was_active is false. Captured before any remediation so operators always see exactly what was active at probe time, even if the reset succeeded. |
| reset_attempted | bool | True when the probe attempted to disable the active profiles. False when no profiles were active or when –skip-writes was set. |
| reset_succeeded | bool | True when the post-reset re-read confirmed EnforcedProfileMask is now zero. Meaningful only when reset_attempted is true. |
| reset_error | string | Diagnostic message when reset_attempted is true and reset_succeeded is false. Empty otherwise. |
Enums
Severity
Severity classifies the impact level of an Issue raised during the health check.
| Name | Number | Description |
|---|---|---|
| SEVERITY_UNSPECIFIED | 0 | none |
| SEVERITY_WARNING | 1 | none |
| SEVERITY_ERROR | 2 | none |