> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/dsx/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/dsx/_mcp/server.

# Breakfix Requirements

## Breakfix Requirements

The NCP must provide a specific "Breakfix API" to support fleet reliability.  Any node-level remediation must not impact other parts of the tenancy; specifically, NVLink must be re-configured properly to take a node out of the tenancy.

The API must enable the following actions:

| Req ID    | Test Details [(Legend)](/dsx/ncp/nvidia-requirements-for-ai-clouds/appendix#test-legend) | Requirement Area   | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| :-------- | :--------------------------------------------------------------------------------------- | :----------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **BFX01** | add                                                                                      | Breakfix Lifecycle | **Compute**: Power-cycle individual nodes or reset a VM instance. GPU: Reset GPUs on an individual node (as needed - k8s) **Maintenance**: Return/Report an individual node and a rack to the Provider for maintenance **Cordon**: Mark a node as unschedulable for new workloads (but finish existing) **Replace**: Request a host-replacement when health thresholds are breached                                                                                                                                                                                                                                                                             |
| **BFX02** |                                                                                          | Breakfix Events    | Query for any upcoming/current maintenance events for a node or rack Query for any retirement notices for a node/rack. Query for historical / status information for equipment repair.   Event information should include: ticket open date ticket update date ticket close date Hardware Stable Identifier (e.g., node ID) Hardware category/type impacted (e.g., GPU, fan, interconnect) Maintenance/Error/fault description (some short description of the issue) Action: Categorization of action (e.g. repairs done on faulty GPUs to resolve the fault) Provider Account ID  ticket ID Node Handover Date (Date when the node was deployed in Production) |
| **BFX03** |                                                                                          | Diagnostics        | Identify serial numbers of installed hardware (chassis, baseboard, network adapters, CPU, GPU, etc). Obfuscated but stable identifiers are also OK. Inspect firmware versions of compute nodes and NV switch trays.                                                                                                                                                                                                                                                                                                                                                                                                                                             |