For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
GitHub
DocumentationREST API Reference
DocumentationREST API Reference
    • Home
  • Overview
    • What is NICo?
    • Key Capabilities
    • Operational Principles
    • Day 0 / Day 1 / Day 2 Lifecycle
    • Scope and Boundaries
  • Getting Started
    • Building NICo Containers
    • Quick Start Guide
  • Provisioning (Day 0 Operations)
    • Ingesting Hosts
    • Ingesting Hosts (REST API)
    • Host Validation
    • SKU Validation
  • DPU Management
    • DPU Lifecycle Management
    • DPU Configuration
    • BlueField DPU Operations
  • Architecture
    • Overview and Components
    • Redfish Workflow
    • Redfish Endpoints Reference
    • Reliable State Handling
    • Networking Integrations
    • Health Checks and Health Aggregation
    • Health Probe IDs
    • Health Alert Classifications
    • Key Group Synchronization
  • Operations
    • NVLink Partitioning
      • Overview
      • Online Repair
      • Release Instance for Full Repair
      • Repair Tenant Workflow
      • Repair System Integration
    • IP Resource Pools
    • BGP Peering
  • Playbooks
    • Azure OIDC for Infra Controller Web UI
    • Force Deleting and Rebuilding Hosts
    • Rebooting a Machine
    • InfiniBand Setup
  • Development
    • Codebase Overview
    • Bootable Artifacts
    • Local Development
    • Running a PXE Client in a VM
    • TLS and SPIFFE Certificates
    • SPIFFE and casbin policies with admin-cli
    • Re-creating Issuer/CA in Local Dev
    • Visual Studio Code Remote Development
    • Adding Support for New Hardware
    • Build Guide
  • Reference
    • Hardware Compatibility List
    • Release Notes
    • FAQs
    • Glossary
GitHub
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogo
On this page
  • Choosing a Repair Path
  • Workflow Summary
  • API Surface
  • Reader Guide
OperationsRepair Workflows

Repair Workflows Overview

||View as Markdown|
Previous

NVLink Partitioning

Next

Online Repair

NICo supports two tenant-facing repair paths and provider-facing repair workflows. Start here to choose the right workflow.

Choosing a Repair Path

SituationUseResult
The tenant should keep the assigned instance while repair is attempted.Online RepairThe assigned instance moves from Ready to Repairing, and then back to Ready when the repair override is cleared.
The issue requires disruptive repair, or online repair did not resolve it.Release Instance for Full RepairThe tenant releases the instance. NICo cleans and quarantines the machine before repair handling.
A platform admin or repair tenant needs to claim and repair a released machine.Repair Tenant WorkflowThe repair tenant creates a targeted repair instance, sets repair outcome labels, and releases the machine back to ready or failed handling.
Provider automation, repair tenants, or operators need to understand repair signals and completion behavior.Repair System IntegrationProvider-side workflows use health overrides and repair completion signals to return machines safely to the allocation pool.

Workflow Summary

Online repair is the least disruptive path. It is requested through the Machine update API, keeps the tenant assignment in place, and uses a repair health override to keep the instance in Repairing.

Full repair is the disruptive path. It is requested through the Instance delete/release API with a machine health issue. The tenant gives up the instance, and NICo prevents the machine from returning to normal allocation until repair and validation are complete.

An instance cannot be released for full repair while it is still in online repair. If online repair is active and the issue now requires full repair, clear online repair first, wait for the instance to return to Ready, and then release the instance for full repair.

When clearing online repair because the repair failed, update the instance labels before releasing it for full repair. A label such as onlineRepair.status: Failed keeps the failure visible to tenant and operator tooling while the instance is being escalated.

Repair system integration is the provider-side behavior behind full repair. It explains how tenant-reported-issue, repair-request, repair tenants, and manual provider actions fit together.

The repair tenant workflow is the operator runbook for the middle of the full repair process: targeted instance creation into a dedicated repair tenant, machine repair, repair outcome labeling, and repair tenant release.

API Surface

WorkflowREST operationRestish operation
Online repairPATCH /v2/org/{org}/nico/machine/{machineId}update-machine
Mark failed online repairPATCH /v2/org/{org}/nico/instance/{instanceId}update-instance
Release for full repairDELETE /v2/org/{org}/nico/instance/{instanceId}delete-instance
Repair tenant machine pickupPOST /v2/org/{org}/nico/instancecreate-instance
Repair status and outcome labelingPATCH /v2/org/{org}/nico/machine/{machineId}update-machine

Use the individual runbooks for exact payloads and validation rules.

Reader Guide

Tenant admins should read:

  1. Online Repair
  2. Release Instance for Full Repair

Provider operators and repair automation owners should also read:

  1. Repair Tenant Workflow
  2. Repair System Integration