Health Alerts and Overrides

View as Markdown

Use this playbook when a host or DPU is blocked by health, or when an operator needs to understand whether an override is safe.

Inspect Health

Aggregate managed-host health:

$nico-admin-cli managed-host show <host-machine-id>

Per-source health reports:

$nico-admin-cli machine health-report show <machine-id>

JSON output for scripting:

$nico-admin-cli -f json machine health-report show <machine-id>

Health Sources

Health is built from multiple report sources.

SourceExamples
Hardware healthBMC sensors, chassis status, leak signals, Redfish state.
DPU agentDPU heartbeat, HBN/BGP state, DHCP relay, network config application.
Validationmachine validation, SKU validation, discovery checks.
Rack or infrastructure healthrack-level inputs when configured.
Overridesoperator or workflow-created health reports.

Classifications

Classifications determine operational impact.

ClassificationImpact
PreventAllocationsHost should not receive new work.
PreventHostStateChangesHost should not move through some lifecycle states while the condition is unresolved.
SuppressExternalAlertingHost should be excluded from fleet-health alerting calculations.
ExcludeFromStateMachineSlaHost should not count against SLA while intentionally held.
StopRebootForAutomaticRecoveryFromStateMachineNICo should not automatically reboot the host during state-machine recovery.

Common Probe Areas

AreaExample probes or symptoms
Machine validationfailed DCGM, CUDA sample, SKU validation, inventory mismatch.
Site Explorer and BMCendpoint exploration failure, Redfish timeout, missing credentials.
Hardware sensorsfan, power, temperature, voltage, leak detection.
DPU agentHeartbeatTimeout, BgpStats, ServiceRunning, DHCP relay/server.
InfiniBandport down, missing or unexpected P_Keys, cleanup pending.
Rack and powerrack health input, power shelf state, switch state.

Overrides

Use overrides sparingly and always include an incident or maintenance reason.

Mark a false positive healthy:

$nico-admin-cli machine health-override add <machine-id> \
> --template mark-healthy \
> --message "false positive INC-123"

Hold a host out of allocation:

$nico-admin-cli machine health-override add <machine-id> \
> --template out-for-repair \
> --message "INC-123 replacing hardware"

Remove an override:

$nico-admin-cli machine health-override remove <machine-id> <source-name>

Guidance

  • Do not override a probe until the owner and impact are understood.
  • Do not use mark-healthy to bypass unknown hardware or DPU failures.
  • Always remove temporary overrides during incident closeout.
  • Prefer maintenance mode when the goal is to suppress SLA noise during investigation.