For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
GitHub
DocumentationREST API Reference
DocumentationREST API Reference
    • Home
  • Overview
    • What is NICo?
    • Key Capabilities
    • Operational Principles
    • Day 0 / Day 1 / Day 2 Lifecycle
    • Scope and Boundaries
  • Getting Started
    • Building NICo Containers
    • Quick Start Guide
  • Architecture
    • Overview and Components
    • Reliable State Handling
    • Networking Integrations
      • Health Checks and Health Aggregation
      • Health Probe IDs
      • Health Alert Classifications
    • Key Group Synchronization
  • Provisioning (Day 0)
    • Ingesting Hosts
    • Ingesting Hosts (REST API)
    • Machine Validation
    • SKU Validation
    • Measured Boot Attestation
  • Configuration (Day 1)
    • Network Isolation
    • Tenant Management
    • Organization & Permissions
  • Operations (Day 2)
    • Tenant Lifecycle Cleanup
    • Network Isolation
    • Network Security Groups
    • InfiniBand Partitioning
    • NVLink Partitioning
    • Rack-Level Administration (RLA)
    • IP Resource Pools
    • BGP Peering
    • nicocli Reference
  • Reference
    • Hardware Compatibility List
    • Release Notes
    • FAQs
    • Glossary
GitHub
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogo
On this page
  • Health alert classifications
  • PreventAllocations
  • PreventHostStateChanges
  • SuppressExternalAlerting
  • ExcludeFromStateMachineSla
  • StopRebootForAutomaticRecoveryFromStateMachine
  • Hardware
  • SensorWarning
  • SensorCritical
  • SensorFailure
ArchitectureHealth Reporting

Health Alert Classifications

||View as Markdown|
Previous

Health Probe IDs

Next

Key Group Synchronization

Health alert classifications

NVIDIA Infra Controller (NICo) currently uses and recognizes the following set of health alert classifications by convention:

PreventAllocations

Hosts with this classification can not be used by tenants as instances. An instance creation request using the hosts Machine ID will fail, unless the targeted instance creation feature is used.

PreventHostStateChanges

Hosts with this classification won’t move between certain states during the host’s lifecycle. The classification is mostly used to prevent a host from moving between states while it is uncertain whether all necessary configurations have been applied.

SuppressExternalAlerting

Hosts with this classification will not be taken into account when calculating site-wide fleet-health. This is achieved by metrics/alerting queries ignoring the amount of hosts with this classification while doing the calculation of 1 - (hosts with alerts / total amount of hosts).

ExcludeFromStateMachineSla

Hosts with this classification will not be counted towards state machine transition time SLA. This classification is mostly used to prevent the state machine from continuously alerting when some manual operations are being performed on the machine.

It is applied automatically (together with PreventAllocations and SuppressExternalAlerting) when a host is placed into maintenance mode via the SetMaintenance RPC, so that stuck-instance / state-machine SLA alerts do not page on-call for hosts an operator is actively working on — regardless of which state or substate the host is in at the time.

StopRebootForAutomaticRecoveryFromStateMachine

For hosts with this classification, the NICo state machine will not automatically execute certain recovery actions (like reboots). The classification can be used to prevent NICo from interacting with hosts while datacenter operators manually perform certain actions.

Hardware

Indicates a hardware-related issue and is used as a broad bucket for hardware/BMC alerts.

SensorWarning

Indicates that a sensor reading violated a caution/warning threshold. In nico-hardware-health, this corresponds to crossing lower_caution/upper_caution thresholds.

SensorCritical

Indicates that a sensor reading violated a critical threshold. In nico-hardware-health, this corresponds to crossing lower_critical/upper_critical thresholds.

SensorFailure

Indicates that a sensor reading is outside the advertised valid range. In nico-hardware-health, this corresponds to values outside range_min/range_max when that range is well-formed.

For BmcSensor alerts, severity is evaluated in this order: SensorFailure -> SensorCritical -> SensorWarning.

Special case for sensor classifications: if thresholds indicate warning/critical/failure but the BMC explicitly reports sensor health as Ok, the probe is treated as success and no alert classification is emitted.