Health Alerts and Overrides
Use this playbook when a host or DPU is blocked by health, or when an operator needs to understand whether an override is safe.
Inspect Health
Aggregate managed-host health:
Per-source health reports:
JSON output for scripting:
Health Sources
Health is built from multiple report sources.
Classifications
Classifications determine operational impact.
Common Probe Areas
Overrides
Use overrides sparingly and always include an incident or maintenance reason.
Mark a false positive healthy:
Hold a host out of allocation:
Remove an override:
Guidance
- Do not override a probe until the owner and impact are understood.
- Do not use
mark-healthyto bypass unknown hardware or DPU failures. - Always remove temporary overrides during incident closeout.
- Prefer maintenance mode when the goal is to suppress SLA noise during investigation.