bmc_health

API Reference: v1/bmc_health.proto

BMC health check service messages.

Defines requests and responses used by the deep BMC pre-requisite health check, which validates Redfish endpoint reachability, firmware inventory, power setting controls, IB/OOB drift, EDPp/WPPS state, and telemetry across cluster nodes.

Messages
Enums
- Severity
Scalar Value Types

Messages

BMCHealthCheckHandle

BMCHealthCheckHandle is returned immediately from StartBMCHealthCheck. The task continues running in the background on the server; callers poll GetBMCHealthCheckStatus and GetBMCHealthCheckReport using task_id to observe progress and retrieve the final report.

task_id is stable for the lifetime of the task in the server’s task manager. After the task completes it remains queryable for at most –zombie-lifetime (default 10m), after which a status or report lookup returns NotFound. A server restart drops all task state, so callers must re-run on restart.

Field	Type	Description
task_id	string	Unique task identifier; pass back to status / report RPCs.
name	string	Human-readable task name (e.g. “bmc_health_check”).
status	string	Initial status string from the task manager (e.g. “Started”).
started_at	google.protobuf.Timestamp	Server-side timestamp when the task was started.

BMCHealthCheckProgress

BMCHealthCheckProgress reports how far the runner has advanced through the in-scope entities. Both counters reset to zero between runs; they are not durable.

Field	Type	Description
nodes_total	int32	Total entities to probe; populated once the task has loaded its entity set from the topology store (zero until then).
nodes_completed	int32	Entities that have produced a response (real, synthesized timeout, or dispatch error).

BMCHealthCheckReportRequest

BMCHealthCheckReportRequest fetches the final report. If the task is still running this returns done=false with no payload; callers should continue polling (or use the dpsctl –wait shortcut).

Field	Type	Description
task_id	string	Task ID returned by StartBMCHealthCheck.
summary_only	bool	When true, the server strips per-node NodeReport entries from the response and returns only Status, ClusterSummary, and Issues. Useful on large clusters where the full per-node detail (firmware lists, endpoint stats, telemetry samples) inflates the payload past what an operator wants to scroll through, and for scripted polling where only the pass/fail summary matters. The underlying task is unaffected; a subsequent call with summary_only=false returns the full report (until the task exits the zombie window).

BMCHealthCheckReportResponse

BMCHealthCheckReportResponse is the response of GetBMCHealthCheckReport.

Field	Type	Description
done	bool	True when the underlying task is finished (success or failure). When false the report field is unset; poll again later.
status	string	Task manager status string at the time of the lookup (“Started” while running, “Done” / “Failed” once finished).
report	BMCHealthCheckResponse	Cluster-level result. Populated only when done=true and the task did not fail before any data could be collected.
error_message	string	Human-readable error string from the task’s terminal error, if any. Empty on a clean run.

BMCHealthCheckRequest

BMCHealthCheckRequest is the request for running a deep BMC pre-requisite health check.

Concurrency note: the server enforces an at-most-one gate that is SERVER-WIDE, not topology- or node-scoped. While any BMC health check is in flight, a subsequent StartBMCHealthCheck for a different topology (or a disjoint Nodes subset) is rejected with AlreadyExists and the existing task_id. This is the intended first-version behavior because in practice topologies on the same server have a high node intersection and serializing protects shared BMCs from overlapping load and write-cycle interleaving. A future revision may loosen this to allow disjoint runs to proceed in parallel.

Field	Type	Description
topology_name	string	Name of the topology whose nodes are to be checked.
nodes	repeated string	Optional explicit list of node names to check. If empty, all nodes in the topology are checked.
oneof _samples_per_telemetry.samples_per_telemetry	optional int32	Optional number of telemetry samples to collect per node.
oneof _concurrency.concurrency	optional int32	Optional maximum number of nodes to check concurrently.
oneof _skip_writes.skip_writes	optional bool	Optional flag to skip write/restore operations on BMCs.
oneof _expected_edpp_pct.expected_edpp_pct	optional int32	Expected steady-state EDPp current value, in percent. The screen rule flags a processor whose EDPp current_pct is below this floor. Default (when unset) is 100. Allowed range when set: 1..200.
oneof _telemetry_interval_ms.telemetry_interval_ms	optional int32	Sleep between telemetry samples on each node, in milliseconds. Zero means no inter-sample delay (samples issued back-to-back). Allowed range when set: 0..60_000 (1 minute). The dpsctl CLI defaults to 500ms; direct API callers that omit the field get the server-side default (no delay, matching the legacy back-to-back semantics).
oneof _write_resolution_timeout_ms.write_resolution_timeout_ms	optional int32	Per-PATCH readback window for the write-cycle probe, in milliseconds. After each PATCH on PowerLimitWatts.SetPoint the probe polls the readback on a small backoff schedule until the value matches or this ceiling elapses. Zero (or unset) uses the bmcprobe default (1s). Operators on very slow BMC fleets can raise this, but values much above 5s tend to amplify failures on already-stuck BMCs rather than recover them. Allowed range when set: 0..60_000 (60s).

BMCHealthCheckResponse

BMCHealthCheckResponse contains the results of a BMC pre-requisite health check.

Field	Type	Description
status	Status	none
node_reports	repeated NodeReport	none
cluster_summary	ClusterSummary	none
issues	repeated Issue	none

BMCHealthCheckStatus

BMCHealthCheckStatus is the response of GetBMCHealthCheckStatus.

Field	Type	Description
task_id	string	Echoes the looked-up task_id.
name	string	Task name (constant for the lifetime of the task).
status	string	Task manager status string: “Init”, “Started”, “Done”, “Failed”.
started_at	google.protobuf.Timestamp	Time the task was started.
completed_at	google.protobuf.Timestamp	Time the task completed; unset while running.
diag_message	string	Diagnostic message when the task panicked. Empty otherwise.
progress	BMCHealthCheckProgress	Coarse progress counters. Available even before completion.
done	bool	True when the task has reached a terminal state (success, failure, or cancellation) and the report is ready to fetch via GetBMCHealthCheckReport.
error_message	string	Human-readable error string from the task’s terminal error, if any (e.g. a loadData failure). Empty when no error was recorded.

BMCHealthCheckStatusRequest

BMCHealthCheckStatusRequest queries the current status of a previously started BMC health check.

Field	Type	Description
task_id	string	Task ID returned by StartBMCHealthCheck.

ClusterSummary

ClusterSummary aggregates pass/fail counts and latency statistics across the cluster.

Field	Type	Description
total_nodes	int64	Nodes in the topology scope at run start (the full set of entities the task intended to probe). Populated by the aggregator from its input view count, independent of how many entries actually carried a usable NodeReport.
passed	int64	none
failed	int64	none
endpoint_latency_ms	DurationStats	none
telemetry_latency_ms	DurationStats	none
power_get_latency_ms	DurationStats	Aggregate latency, in milliseconds, for power-limit GETs across all nodes/processors.
power_set_latency_ms	DurationStats	Aggregate latency, in milliseconds, for power-limit SETs across all nodes/processors.
edpp_get_latency_ms	DurationStats	Aggregate latency, in milliseconds, for EDPp percent reads across all nodes/processors.
wpps_get_latency_ms	DurationStats	Aggregate latency, in milliseconds, for WPPS endpoint reads across all nodes.
nodes_attempted	int32	Nodes the runner actually invoked a probe against, i.e. entries that reached the screening pass with a non-nil NodeReport. May be smaller than total_nodes if the aggregator dropped malformed entries.
nodes_unreachable	int32	Nodes for which probing failed before any sample was collected (BMC unreachable, auth, etc.).

DurationStats

DurationStats captures latency statistics, in milliseconds.

Field	Type	Description
count	int64	none
min_ms	int64	none
max_ms	int64	none
avg_ms	int64	none
p50_ms	int64	none
p95_ms	int64	none
p99_ms	int64	none

EDPpRead

EDPpRead captures the EDPp reading for a processor.

Field	Type	Description
processor	string	none
current_pct	int32	none
reference_pct	int32	none
reference_available	bool	none

EndpointStat

EndpointStat records request statistics for a single Redfish endpoint+method pair.

Field	Type	Description
endpoint	string	none
method	string	none
attempts	int64	none
successes	int64	none
failures	int64	none
duration	DurationStats	none

FirmwareEntry

FirmwareEntry describes a single firmware inventory entry from the BMC.

The node attribution is denormalized onto every entry (rather than only living on the enclosing NodeReport) so the firmware list is usable when flattened across the cluster – e.g. when correlating firmware versions against failing-node Issues, or when emitting a per-node firmware summary into structured logs without re-walking the parent report.

Field	Type	Description
id	string	Inventory ID as reported by the BMC (e.g. “BMC_0”, “HMC_0”, “ERoT_BMC_0”). Stable across probe runs.
version	string	Version string returned by the BMC. Empty when the inventory item existed but its version field could not be parsed.
node	string	Node name the firmware was read from. Always set by probeFirmware so consumers can filter the cluster’s firmware list by node without joining back to the parent NodeReport.

IBOOBDrift

IBOOBDrift captures the drift between in-band and out-of-band power values.

Field	Type	Description
component	string	none
set_point_w	int64	none
oneshot_w	int64	none
drift	bool	none

Issue

Issue describes a single problem detected during the health check.

Field	Type	Description
severity	Severity	none
code	string	none
node	string	none
resource	string	none
observed	string	none
threshold	string	none
message	string	none

NodeReport

NodeReport summarizes the BMC pre-requisite health check results for a single node.

Field	Type	Description
node	string	none
reachable	bool	none
error_msg	string	none
firmware	repeated FirmwareEntry	none
endpoint_stats	repeated EndpointStat	none
power_settings	repeated PowerSettingRead	none
power_writes	repeated PowerWriteCycleStat	none
ib_oob_drift	repeated IBOOBDrift	none
edpp	repeated EDPpRead	none
wpps	repeated WPPSRead	none
telemetry_samples	int64	none
telemetry	repeated TelemetryComponentStat	none

PowerSettingRead

PowerSettingRead captures the readable power setting state for a component.

Field	Type	Description
component	string	none
set_point_w	int64	none
allowable_max_w	int64	none
default_set_point_w	int64	none
power_w	double	none

PowerWriteCycleStat

PowerWriteCycleStat captures the results of an exercised power write/restore cycle.

Field	Type	Description
component	string	none
original_w	int64	none
decrement_cycles	int32	Number of decrement-by-1W iterations the probe attempted.
mismatches	int32	Total readback / set mismatches observed across all decrement and increment iterations plus the final restore readback.
restored	bool	none
restored_value_w	int64	none
restore_error	string	none
increment_cycles	int32	Number of increment-by-1W iterations the probe attempted as part of walking the SetPoint back up to the captured original.

TelemetryComponentStat

TelemetryComponentStat summarizes telemetry samples for a single component.

Field	Type	Description
component	string	none
power_w	ValueStats	none

ValueStats

ValueStats captures statistics over a numeric value (e.g. telemetry power).

Field	Type	Description
count	int64	none
min	double	none
max	double	none
mean	double	none
p50	double	none
p95	double	none
p99	double	none

WPPSRead

WPPSRead captures the workload power profile state for a processor.

Field	Type	Description
processor	string	none
accessible	bool	none
enforced_mask	string	none
requested_mask	string	none
supported_mask	string	none
available_profile_ids	repeated int32	none
error_msg	string	none
was_active	bool	True when EnforcedProfileMask was non-zero on probe – i.e. one or more workload power profiles were active on this processor at probe time. Lights-on validation requires no active profiles; the screen pass converts this into a WPPS_PROFILES_ACTIVE Issue.
active_profile_ids	repeated int32	The profile IDs decoded from the original EnforcedProfileMask, in ascending order. Empty when was_active is false. Captured before any remediation so operators always see exactly what was active at probe time, even if the reset succeeded.
reset_attempted	bool	True when the probe attempted to disable the active profiles. False when no profiles were active or when –skip-writes was set.
reset_succeeded	bool	True when the post-reset re-read confirmed EnforcedProfileMask is now zero. Meaningful only when reset_attempted is true.
reset_error	string	Diagnostic message when reset_attempted is true and reset_succeeded is false. Empty otherwise.

Enums

Severity

Severity classifies the impact level of an Issue raised during the health check.

Name	Number	Description
SEVERITY_UNSPECIFIED	0	none
SEVERITY_WARNING	1	none
SEVERITY_ERROR	2	none

Scalar Value Types

.proto Type	Notes	C++ Type	Java Type	Python Type
double		double	double	float
float		float	float	float
int32	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.	int32	int	int
int64	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.	int64	long	int/long
uint32	Uses variable-length encoding.	uint32	int	int/long
uint64	Uses variable-length encoding.	uint64	long	int/long
sint32	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.	int32	int	int
sint64	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.	int64	long	int/long
fixed32	Always four bytes. More efficient than uint32 if values are often greater than 2^28.	uint32	int	int
fixed64	Always eight bytes. More efficient than uint64 if values are often greater than 2^56.	uint64	long	int/long
sfixed32	Always four bytes.	int32	int	int
sfixed64	Always eight bytes.	int64	long	int/long
bool		bool	boolean	boolean
string	A string must always contain UTF-8 encoded or 7-bit ASCII text.	string	String	str/unicode
bytes	May contain any arbitrary sequence of bytes.	string	ByteString	str