> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/dsx/llms.txt.
> For full documentation content, see https://docs.nvidia.com/dsx/llms-full.txt.

# Break-Fix Architecture

The mission of the NCP Software Reference architecture break-fix system
is to ensure that tenants have high availability on clusters through
automations that:

* Ensure healthy cluster hand-off from cloud provider to tenant
* Passively screen, triage, and remediate hardware (field diags→RMA) and
  software components (applying updates such as firmware, software
  upgrades, and system reboots)
* Minimize human intervention required to fix broken nodes

The focus is to create a day 2 Break-Fix system that is focused on
monitoring and, to the greatest extent possible, automating the
remediation of the GPU-based compute nodes. Day 2 is defined as once the
data center is online and validated and ready for normal tenant
operation. It is critical to minimize downtime.

## Break-Fix Overview

The intent of the Break-Fix architecture is to "detect, triage,
remediate, and validate" GPU nodes for AI infrastructure. This is
sometimes referred to as health checks and break-fix, but this document
will consider break-fix to be inclusive of the health checks. The goal
of this section is to describe how one can create a robust Break-Fix
architecture that can be operationalized across a wide array of AI
platforms and workloads and operated by an NCP.

While the architecture can be expanded to support traditional compute
nodes, the focus of this document is on GPU nodes. By creating an
automated way to handle break-fix scenarios, especially ones where the
entire fix can be deployed without human intervention, the resulting
uptime can be made significantly better.

## Break-Fix

The Break-Fix architecture should be thought of as having two primary
domains. The first domain, the **Tenant Domain (TD)**, includes all
break-fix operations done while the node or GPU is in possession of a
tenant. The second domain, the **Infrastructure Operator Domain (IOD)**,
includes all break-fix operations done while the machine is in control
of the infrastructure operator. The notable exception to this split is
that continuous/offline checks are performed by the IOD even when a
machine is part of the Tenant Domain. It is important to keep these two
domains logically separated.

The Tenant Domain has three primary roles:

* **Continuous in-band Node Health Checks**: This requires the tenant to
  run a set of health checks on a regular heartbeat to inform the
  break-fix system on the overall health of each node. Thus, the Tenant
  must opt-in to running continuous health checks and participating in
  the overall break-fix flow.
* **Tenant Remediation**: Where possible, the goal of the break-fix
  system is to fix any unhealthy nodes as quickly as possible while
  minimizing impact on the tenant. Many issues can be handled without
  removing the node from circulation, such as doing a PCIe GPU reset.
* **Node Return**: If the node's health checks identify an issue that
  cannot be remediated while in the tenant's possession, the tenant must
  agree to return the node to the IOD domain for further action. This
  could result in another node getting allocated, or it could result in
  a reduced set of capabilities for the tenant.

The Infrastructure Operator Domain (IOD) has three primary roles:

* **Initial Node Validation**: A key principle is to make sure that the
  IaaS service always delivers a known-good node to the tenant upon
  initial vending. Thus, while a node is under the IOD, the
  infrastructure operator needs to be doing node health checks and
  making sure the node is functioning as expected (to the extent the
  tests can show).
* **Continuous Offline Node Health Checks**: Some health checks can be
  done in an offline manner, so they do not require the tenant to opt in.
  Some examples of offline checks on the platform or data lake could be
  BMC query for node thermals, DPU heartbeat, and log file monitoring to
  find indications of issues. As mentioned, this is being done by the
  IOD even when the machine is in the Tenant Domain.
* **Node Remediation**: When a node issue is identified by any
  mechanism, process the node (in as automated a way as possible) to
  remove the node from circulation, get it healthy, and put it back into
  circulation.

### Node-Level Health Checks

There are many utilities to help with node-level health checks, both
created by NVIDIA and in the ecosystem. For the NCP Software Reference
Guide architecture, no specific tool integration is assumed, but it is
assumed that NVIDIA's Data Center GPU Manager (DCGM) entity is used.

Health checks may run at all levels of infrastructure. This is due to
the fact that certain information is only available at a particular
level of the stack. For example, many of the components need direct
access to the GPU driver. This might exist at the Host OS level, or it
might be provided as a passthrough device to the VM level. The net
effect is that a certain set of health checks need to run at the
appropriate place.

The health check entity typically includes a harness — to run the local
health checks on a regular heartbeat and to interact with the overall
break-fix solution — and the actual set of checks themselves. Some
health checks may run at every interval, and others may run less often
or only upon request. As health issues are detected, the health check
entities will report back to the break-fix control plane to act.

![Break-Fix Architecture](https://files.buildwithfern.com/nvidia-dsx.docs.buildwithfern.com/dsx/a411a4d95dadf0b07f56479037a5ecfbb39f67b31c5565700ae29ebcf8e5ba8b/_dot_dot_/docs/guides/ncp-software-reference-guide/assets/images/ncp-srg-break-fix-arch.png)