Breakfix Requirements

View as Markdown

Breakfix Requirements

The NCP must provide a specific “Breakfix API” to support fleet reliability. Any node-level remediation must not impact other parts of the tenancy; specifically, NVLink must be re-configured properly to take a node out of the tenancy.

The API must enable the following actions:

Req IDTest Details (Legend)Requirement AreaDescription
BFX01addBreakfix LifecycleCompute: Power-cycle individual nodes or reset a VM instance. GPU: Reset GPUs on an individual node (as needed - k8s) Maintenance: Return/Report an individual node and a rack to the Provider for maintenance Cordon: Mark a node as unschedulable for new workloads (but finish existing) Replace: Request a host-replacement when health thresholds are breached
BFX02Breakfix EventsQuery for any upcoming/current maintenance events for a node or rack Query for any retirement notices for a node/rack. Query for historical / status information for equipment repair. Event information should include: ticket open date ticket update date ticket close date Hardware Stable Identifier (e.g., node ID) Hardware category/type impacted (e.g., GPU, fan, interconnect) Maintenance/Error/fault description (some short description of the issue) Action: Categorization of action (e.g. repairs done on faulty GPUs to resolve the fault) Provider Account ID ticket ID Node Handover Date (Date when the node was deployed in Production)
BFX03DiagnosticsIdentify serial numbers of installed hardware (chassis, baseboard, network adapters, CPU, GPU, etc). Obfuscated but stable identifiers are also OK. Inspect firmware versions of compute nodes and NV switch trays.