NVIDIA Infra Controller (NICo) integrates a variety of tools to continuously assess and report the health of any host under its management. It also allows site operators to configure and extend the set of health checks via runtime configurations and extension APIs.
The health information that is obtained by these tools is rolled up within NICo Core into an “aggregated host health”. The aggregated host health information is used for multiple purposes:
Health checks roughly fall into 3 categories:
The overall health of the system can be seen as the combination of all health reports
reports. If any component reports that a subsystem is not healthy, then the
overall system is not healthy. This combination of health-reports is performed
inside nico-core at any time the health status of a host is queried.
A more detailed list of health probes can be found in Health Probe IDs.
A list of health alert classifications can be found in Health Alert Classifications.
The following diagram provides an overview about the current sources of health information within NICo, and how they are rolled up for API users:
NICo components exchange and store aggregated health information internally in a datastructure called HealthReport. It contains a set of failed health checks (alerts) as well as a set of succeeded health checks (successes). Each check describes exactly which component had been probed (id and target fields).
The datastructure had been designed and optimized for merging health information from a variety of sources into an aggregate report. E.g. if 2 subsystems report health, and each subsystem reports 1 health alert, the aggregate health report will contain 2 alerts if the alerts are reported by different probe IDs.
A Health report is described as follows in gRPC format. Health reports are in some workflows also exposed in other formats - e.g. JSON. These formats would still follow the same schema.
For failed health checks, the HealthProbeAlert can carry an optional set of classifications that describe how the system will react on the failed health check.
The core idea here is that not all types of alerts have the same significance, and that different alerts will require a different response by NICo and site administrators: E.g. a BGP peering failure with a BGP peering issue on just one of the 2 redundant links will not render a host automatically unusable, while a fully unreachable DPU implies that the host can’t be used.
Health alert classifications decouple the NICo logic from the actual alert IDs. E.g. NICo logic does not have to encode an exhaustive check over all possible health probe IDs:
Instead of this, it can just scan whether any of the health alerts in the aggregate host health carries a certain condition:
This mechanism also allows site-administrator provided health checks via Health report override APIs to trigger the same behavior as integrated health checks.
The set of classifications that are currently interpreted by NICo is described in List of Health Alert Classifications
NICo will schedule the execution of validation tests via the scout tool on the actual host at various points
in the lifecycle of a managed host:
The set of tests that are run on a host are defined by the site administrator.
Each test is defined as an arbitrary shell script which needs to run and is expected to return an exit code of 0.
The framework thereby allows the execution of off-the-shelf tests, e.g. using the tools dcgm, stress-ng or benchpress.
If Machine Validation fails, a Health Alert with ID FailedValidationTest or FailedValidationTestCompletion will be placed on the host to make the host un-allocatable by tenants.
In addition to that, the full test output (stdout and stderr) will be stored within nico-core and is made available to NICo users via APIs, admin-cli and admin-ui.
Details can be found in the Machine Validation guide.
SKU validation is a feature in NICo which validates that a host contains all the hardware it is expected to contain by validating it to “conform to a certain SKU”. The SKU is the definition of hardware components within the host. And the SKU validation workflow compares it to the set of hardware components that have been detected via NICo hardware discovery workflows - which utilize inband data as well as out of band data.
SKU validation can thereby e.g. detect
SKU validation runs at the same points in the host lifecycle as machine validation tests, and can also be run on-demand while the host is not assigned to any tenant.
If SKU validation fails, a Health Alert with ID SkuValidation will be placed on the host
to make the host un-allocatable by tenants.
Details can be found in the SKU Validation guide.
The nico-hw-health service periodically queries all Host and DPU BMCs in the system for health information. It emits the captured health datapoints as metrics on a metrics endpoint that can be scraped by a standard telemetry system (prometheus/otel).
Health metrics fetched from BMCs include:
In addition to metrics, nico-hw-health also extracts the values of various event-logs from the BMC and stores them on-disk in order to make them easily accessible for a standard telemetry exporter (e.g. OpenTelemetry Collector based).
Finally, nico-hw-health also emits a health-rollup in HealthReport format towards nico-core that contains an assessed health status of the host based on the extracted metrics.
This assessed health status is built by comparing the metrics that are emitted from BMCs against well-defined
ranges or by interpreting the health_ok values provided by BMCs.
For production deployments, nico-hw-health discovers machine, switch, and power-shelf BMC endpoints from NICo API via [endpoint_sources.nico_api]. Machine endpoints carry the inventory metadata needed to interpret hardware health in fleet context, including machine ID, serial number, rack ID, rack placement, and NVLink domain UUID when present. Switch endpoints carry switch ID, serial number, and rack placement when present. Local and test deployments can instead configure explicit machine, switch, or power-shelf identity with [[endpoint_sources.static_bmc_endpoints]]; static machine endpoints can include the same serial number, rack placement, and NVLink domain UUID metadata, static switch endpoints can include serial number and rack placement metadata, and all static endpoints can provide rack_id when rack-level rollups are needed.
The publishing sinks expose that inventory context using the conventions of the target backend:
[sinks.prometheus] adds machine metadata as metric labels named machine_id, serial_number, machine_slot_number, machine_tray_index, and nvlink_domain_uuid; switch metadata uses switch_id, serial_number, switch_slot_number, and switch_tray_index.[sinks.otlp] adds machine metadata as OTLP resource attributes named machine.id, integer machine.slot_number, integer machine.tray_index, and nvlink.domain.uuid; switch metadata uses switch.id, integer switch.slot_number, and integer switch.tray_index.[sinks.health_report], [sinks.rack_health_report], [sinks.switch_health_report], and [sinks.power_shelf_health_report] use the same event context when submitting assessed health reports back to NICo API. The persisted HealthReport and HealthProbeAlert schemas remain the probe success/alert model described above.The Site Explorer process within NICo Core periodically queries all Host and DPU BMCs in order to record certain BMC properties (e.g. components within a host and firmware versions).
In certain conditions the scraping process will place a health alert on the host:
dpu-agent collects health information directly on the DPU and sends a health-rollup towards nico-core. The agent monitors a variety of health conditions, including
Site administrators are able to update the health state of any NICo managed host via
the API calls InsertHealthReportOverride and RemoveHealthReportOverride.
The override API offers 2 different modes of operation:
merge (default) - In this mode, any health probe alerts indicated in the override
will get merged with health probe alerts reported by builtin NICo tools in order
to derive the aggregate host health status. This mode is meant to augment the internal health monitoring mechanism with additional sources of health datareplace - In this mode, the health probe alerts reported by builtin NICo
monitoring tools will be ignored. Only alerts that are passed as part of the
override will be taken into account. If the override list is empty, the system
will behave as if the Host would be fully healthy. This mode is meant to bypass the internal health data in case the site operator desires a different behaviorThe API allows to apply multiple merge overrides to a hosts health at the same time by using a different HealthReport::source identifier.
This allows to integrate health information from multiple external systems and users which are not at risk of overriding each others data. E.g. health information from an external fleet health monitoring system and from SREs can be stored independently.
If a ManagedHost’s health is overridden, the remaining behavior is exactly the same as if the overridden Health report would have been directly derived from monitoring hardware health:
PreventAllocations classification is present in the aggregate host health