bmc_health

API Reference: v1/bmc_health.proto

BMC health check service messages.

Defines requests and responses used by the deep BMC pre-requisite health check, which validates Redfish endpoint reachability, firmware inventory, power setting controls, IB/OOB drift, EDPp/WPPS state, and telemetry across cluster nodes.

Table of Contents

Messages

BMCHealthCheckHandle

BMCHealthCheckHandle is returned immediately from StartBMCHealthCheck. The task continues running in the background on the server; callers poll GetBMCHealthCheckStatus and GetBMCHealthCheckReport using task_id to observe progress and retrieve the final report.

task_id is stable for the lifetime of the task in the server’s task manager. After the task completes it remains queryable for at most –zombie-lifetime (default 10m), after which a status or report lookup returns NotFound. A server restart drops all task state, so callers must re-run on restart.

Field Type Description
task_id string Unique task identifier; pass back to status / report RPCs.
name string Human-readable task name (e.g. “bmc_health_check”).
status string Initial status string from the task manager (e.g. “Started”).
started_at google.protobuf.Timestamp Server-side timestamp when the task was started.

BMCHealthCheckProgress

BMCHealthCheckProgress reports how far the runner has advanced through the in-scope entities. Both counters reset to zero between runs; they are not durable.

Field Type Description
nodes_total int32 Total entities to probe; populated once the task has loaded its entity set from the topology store (zero until then).
nodes_completed int32 Entities that have produced a response (real, synthesized timeout, or dispatch error).

BMCHealthCheckReportRequest

BMCHealthCheckReportRequest fetches the final report. If the task is still running this returns done=false with no payload; callers should continue polling (or use the dpsctl –wait shortcut).

Field Type Description
task_id string Task ID returned by StartBMCHealthCheck.
summary_only bool When true, the server strips per-node NodeReport entries from the response and returns only Status, ClusterSummary, and Issues. Useful on large clusters where the full per-node detail (firmware lists, endpoint stats, telemetry samples) inflates the payload past what an operator wants to scroll through, and for scripted polling where only the pass/fail summary matters. The underlying task is unaffected; a subsequent call with summary_only=false returns the full report (until the task exits the zombie window).

BMCHealthCheckReportResponse

BMCHealthCheckReportResponse is the response of GetBMCHealthCheckReport.

Field Type Description
done bool True when the underlying task is finished (success or failure). When false the report field is unset; poll again later.
status string Task manager status string at the time of the lookup (“Started” while running, “Done” / “Failed” once finished).
report BMCHealthCheckResponse Cluster-level result. Populated only when done=true and the task did not fail before any data could be collected.
error_message string Human-readable error string from the task’s terminal error, if any. Empty on a clean run.

BMCHealthCheckRequest

BMCHealthCheckRequest is the request for running a deep BMC pre-requisite health check.

Concurrency note: the server enforces an at-most-one gate that is SERVER-WIDE, not topology- or node-scoped. While any BMC health check is in flight, a subsequent StartBMCHealthCheck for a different topology (or a disjoint Nodes subset) is rejected with AlreadyExists and the existing task_id. This is the intended first-version behavior because in practice topologies on the same server have a high node intersection and serializing protects shared BMCs from overlapping load and write-cycle interleaving. A future revision may loosen this to allow disjoint runs to proceed in parallel.

Field Type Description
topology_name string Name of the topology whose nodes are to be checked.
nodes repeated string Optional explicit list of node names to check. If empty, all nodes in the topology are checked.
oneof _samples_per_telemetry.samples_per_telemetry optional int32 Optional number of telemetry samples to collect per node.
oneof _concurrency.concurrency optional int32 Optional maximum number of nodes to check concurrently.
oneof _skip_writes.skip_writes optional bool Optional flag to skip write/restore operations on BMCs.
oneof _expected_edpp_pct.expected_edpp_pct optional int32 Expected steady-state EDPp current value, in percent. The screen rule flags a processor whose EDPp current_pct is below this floor. Default (when unset) is 100. Allowed range when set: 1..200.
oneof _telemetry_interval_ms.telemetry_interval_ms optional int32 Sleep between telemetry samples on each node, in milliseconds. Zero means no inter-sample delay (samples issued back-to-back). Allowed range when set: 0..60_000 (1 minute). The dpsctl CLI defaults to 500ms; direct API callers that omit the field get the server-side default (no delay, matching the legacy back-to-back semantics).
oneof _write_resolution_timeout_ms.write_resolution_timeout_ms optional int32 Per-PATCH readback window for the write-cycle probe, in milliseconds. After each PATCH on PowerLimitWatts.SetPoint the probe polls the readback on a small backoff schedule until the value matches or this ceiling elapses. Zero (or unset) uses the bmcprobe default (1s). Operators on very slow BMC fleets can raise this, but values much above 5s tend to amplify failures on already-stuck BMCs rather than recover them. Allowed range when set: 0..60_000 (60s).

BMCHealthCheckResponse

BMCHealthCheckResponse contains the results of a BMC pre-requisite health check.

Field Type Description
status Status none
node_reports repeated NodeReport none
cluster_summary ClusterSummary none
issues repeated Issue none

BMCHealthCheckStatus

BMCHealthCheckStatus is the response of GetBMCHealthCheckStatus.

Field Type Description
task_id string Echoes the looked-up task_id.
name string Task name (constant for the lifetime of the task).
status string Task manager status string: “Init”, “Started”, “Done”, “Failed”.
started_at google.protobuf.Timestamp Time the task was started.
completed_at google.protobuf.Timestamp Time the task completed; unset while running.
diag_message string Diagnostic message when the task panicked. Empty otherwise.
progress BMCHealthCheckProgress Coarse progress counters. Available even before completion.
done bool True when the task has reached a terminal state (success, failure, or cancellation) and the report is ready to fetch via GetBMCHealthCheckReport.
error_message string Human-readable error string from the task’s terminal error, if any (e.g. a loadData failure). Empty when no error was recorded.

BMCHealthCheckStatusRequest

BMCHealthCheckStatusRequest queries the current status of a previously started BMC health check.

Field Type Description
task_id string Task ID returned by StartBMCHealthCheck.

ClusterSummary

ClusterSummary aggregates pass/fail counts and latency statistics across the cluster.

Field Type Description
total_nodes int64 Nodes in the topology scope at run start (the full set of entities the task intended to probe). Populated by the aggregator from its input view count, independent of how many entries actually carried a usable NodeReport.
passed int64 none
failed int64 none
endpoint_latency_ms DurationStats none
telemetry_latency_ms DurationStats none
power_get_latency_ms DurationStats Aggregate latency, in milliseconds, for power-limit GETs across all nodes/processors.
power_set_latency_ms DurationStats Aggregate latency, in milliseconds, for power-limit SETs across all nodes/processors.
edpp_get_latency_ms DurationStats Aggregate latency, in milliseconds, for EDPp percent reads across all nodes/processors.
wpps_get_latency_ms DurationStats Aggregate latency, in milliseconds, for WPPS endpoint reads across all nodes.
nodes_attempted int32 Nodes the runner actually invoked a probe against, i.e. entries that reached the screening pass with a non-nil NodeReport. May be smaller than total_nodes if the aggregator dropped malformed entries.
nodes_unreachable int32 Nodes for which probing failed before any sample was collected (BMC unreachable, auth, etc.).

DurationStats

DurationStats captures latency statistics, in milliseconds.

Field Type Description
count int64 none
min_ms int64 none
max_ms int64 none
avg_ms int64 none
p50_ms int64 none
p95_ms int64 none
p99_ms int64 none

EDPpRead

EDPpRead captures the EDPp reading for a processor.

Field Type Description
processor string none
current_pct int32 none
reference_pct int32 none
reference_available bool none

EndpointStat

EndpointStat records request statistics for a single Redfish endpoint+method pair.

Field Type Description
endpoint string none
method string none
attempts int64 none
successes int64 none
failures int64 none
duration DurationStats none

FirmwareEntry

FirmwareEntry describes a single firmware inventory entry from the BMC.

The node attribution is denormalized onto every entry (rather than only living on the enclosing NodeReport) so the firmware list is usable when flattened across the cluster – e.g. when correlating firmware versions against failing-node Issues, or when emitting a per-node firmware summary into structured logs without re-walking the parent report.

Field Type Description
id string Inventory ID as reported by the BMC (e.g. “BMC_0”, “HMC_0”, “ERoT_BMC_0”). Stable across probe runs.
version string Version string returned by the BMC. Empty when the inventory item existed but its version field could not be parsed.
node string Node name the firmware was read from. Always set by probeFirmware so consumers can filter the cluster’s firmware list by node without joining back to the parent NodeReport.

IBOOBDrift

IBOOBDrift captures the drift between in-band and out-of-band power values.

Field Type Description
component string none
set_point_w int64 none
oneshot_w int64 none
drift bool none

Issue

Issue describes a single problem detected during the health check.

Field Type Description
severity Severity none
code string none
node string none
resource string none
observed string none
threshold string none
message string none

NodeReport

NodeReport summarizes the BMC pre-requisite health check results for a single node.

Field Type Description
node string none
reachable bool none
error_msg string none
firmware repeated FirmwareEntry none
endpoint_stats repeated EndpointStat none
power_settings repeated PowerSettingRead none
power_writes repeated PowerWriteCycleStat none
ib_oob_drift repeated IBOOBDrift none
edpp repeated EDPpRead none
wpps repeated WPPSRead none
telemetry_samples int64 none
telemetry repeated TelemetryComponentStat none

PowerSettingRead

PowerSettingRead captures the readable power setting state for a component.

Field Type Description
component string none
set_point_w int64 none
allowable_max_w int64 none
default_set_point_w int64 none
power_w double none

PowerWriteCycleStat

PowerWriteCycleStat captures the results of an exercised power write/restore cycle.

Field Type Description
component string none
original_w int64 none
decrement_cycles int32 Number of decrement-by-1W iterations the probe attempted.
mismatches int32 Total readback / set mismatches observed across all decrement and increment iterations plus the final restore readback.
restored bool none
restored_value_w int64 none
restore_error string none
increment_cycles int32 Number of increment-by-1W iterations the probe attempted as part of walking the SetPoint back up to the captured original.

TelemetryComponentStat

TelemetryComponentStat summarizes telemetry samples for a single component.

Field Type Description
component string none
power_w ValueStats none

ValueStats

ValueStats captures statistics over a numeric value (e.g. telemetry power).

Field Type Description
count int64 none
min double none
max double none
mean double none
p50 double none
p95 double none
p99 double none

WPPSRead

WPPSRead captures the workload power profile state for a processor.

Field Type Description
processor string none
accessible bool none
enforced_mask string none
requested_mask string none
supported_mask string none
available_profile_ids repeated int32 none
error_msg string none
was_active bool True when EnforcedProfileMask was non-zero on probe – i.e. one or more workload power profiles were active on this processor at probe time. Lights-on validation requires no active profiles; the screen pass converts this into a WPPS_PROFILES_ACTIVE Issue.
active_profile_ids repeated int32 The profile IDs decoded from the original EnforcedProfileMask, in ascending order. Empty when was_active is false. Captured before any remediation so operators always see exactly what was active at probe time, even if the reset succeeded.
reset_attempted bool True when the probe attempted to disable the active profiles. False when no profiles were active or when –skip-writes was set.
reset_succeeded bool True when the post-reset re-read confirmed EnforcedProfileMask is now zero. Meaningful only when reset_attempted is true.
reset_error string Diagnostic message when reset_attempted is true and reset_succeeded is false. Empty otherwise.

Enums

Severity

Severity classifies the impact level of an Issue raised during the health check.

Name Number Description
SEVERITY_UNSPECIFIED 0 none
SEVERITY_WARNING 1 none
SEVERITY_ERROR 2 none

Scalar Value Types

.proto Type Notes C++ Type Java Type Python Type

double
double double float

float
float float float

int32
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int

int64
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long

uint32
Uses variable-length encoding. uint32 int int/long

uint64
Uses variable-length encoding. uint64 long int/long

sint32
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int

sint64
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long

fixed32
Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int

fixed64
Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long

sfixed32
Always four bytes. int32 int int

sfixed64
Always eight bytes. int64 long int/long

bool
bool boolean boolean

string
A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode

bytes
May contain any arbitrary sequence of bytes. string ByteString str