For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
GitHub
DocumentationREST API Reference
DocumentationREST API Reference
    • Home
  • Overview
    • What is NICo?
    • Key Capabilities
    • Operational Principles
    • Day 0 / Day 1 / Day 2 Lifecycle
    • Scope and Boundaries
  • Getting Started
    • Building NICo Containers
    • Quick Start Guide
  • Provisioning (Day 0 Operations)
    • Ingesting Hosts
    • Ingesting Hosts (REST API)
    • Host Validation
    • SKU Validation
  • DPU Management
    • DPU Lifecycle Management
    • DPU Configuration
    • BlueField DPU Operations
  • Architecture
    • Overview and Components
    • Redfish Workflow
    • Redfish Endpoints Reference
    • Reliable State Handling
    • Networking Integrations
    • Health Checks and Health Aggregation
    • Health Probe IDs
    • Health Alert Classifications
    • Key Group Synchronization
  • Operations
    • NVLink Partitioning
      • Overview
      • Online Repair
      • Release Instance for Full Repair
      • Repair Tenant Workflow
      • Repair System Integration
    • IP Resource Pools
    • BGP Peering
  • Playbooks
    • Azure OIDC for Infra Controller Web UI
    • Force Deleting and Rebuilding Hosts
    • Rebooting a Machine
    • InfiniBand Setup
  • Development
    • Codebase Overview
    • Bootable Artifacts
    • Local Development
    • Running a PXE Client in a VM
    • TLS and SPIFFE Certificates
    • SPIFFE and casbin policies with admin-cli
    • Re-creating Issuer/CA in Local Dev
    • Visual Studio Code Remote Development
    • Adding Support for New Hardware
    • Build Guide
  • Reference
    • Hardware Compatibility List
    • Release Notes
    • FAQs
    • Glossary
GitHub
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogo
On this page
  • Audience and Access
  • What Online Repair Does
  • REST and Restish
  • Before You Start
  • Enter Online Repair
  • Health Issue Fields
  • Exit Online Repair
  • Mark Failed Online Repair
  • Verification
  • Validation Rules
  • Troubleshooting
OperationsRepair Workflows

Online Repair

||View as Markdown|
Previous

Overview

Next

Release Instance for Full Repair

Online repair lets a privileged tenant admin report a machine health issue and move the assigned instance from Ready to Repairing without releasing the instance. Use this workflow when the repair can be attempted while the tenant keeps the instance assignment.

If online repair cannot fix the issue, clear online repair first and then release the instance for full repair. See Release Instance for Full Repair.

Audience and Access

This page is intended for tenant admins and platform operators writing tenant-facing runbooks.

The caller must have access to the Infra Controller REST API through an API profile such as carbide-stg. The online repair operation is allowed for provider admins and privileged tenant admins. In tenant workflows, this means the tenant must have the required privileged capability for repair operations, such as targeted instance creation access.

What Online Repair Does

When online repair is enabled:

  1. The API validates that the machine is assigned to an instance and that the instance is currently Ready.
  2. The API records online repair metadata on the instance.
  3. The API sends a health override to the site for the reported issue.
  4. The assigned instance is moved to Repairing.
  5. NICo Core treats the online repair health override as an active repair signal, so tenant-facing state remains Repairing while the repair override is active and the instance is otherwise tenant-ready.

When online repair is disabled:

  1. The API removes the online repair health override.
  2. Online repair metadata is removed from the instance.
  3. The assigned instance is moved back to Ready.

REST and Restish

The REST API operation is:

PATCH /v2/org/{org}/nico/machine/{machineId}

The OpenAPI operation ID is update-machine. Restish exposes operation IDs as commands, so the command shape is:

$restish <api-profile> update-machine <tenant-org-id> <machine-id> < <request-body-json>

For example, in staging:

$restish carbide-stg update-machine <tenant-org-id> <machine-id> < online-repair-on.json

carbide-stg is the Restish API profile or environment. Replace it with the profile for your deployment.

Use Restish help to confirm the operation signature in the target environment:

$restish carbide-stg update-machine --help

Restish prints the HTTP status and JSON error body when a request fails. Use that response body when troubleshooting validation or permission errors.

Before You Start

Collect the following values:

ValueDescription
<api-profile>Restish profile, for example carbide-stg.
<tenant-org-id>Tenant organization identifier used by the REST API.
<machine-id>Machine ID assigned to the tenant instance. This is the fm... machine identifier, not the instance UUID.

Confirm these preconditions:

  • The machine is assigned to an instance.
  • The assigned instance is in Ready.
  • The machine is present on the site.
  • The issue can be investigated without the tenant releasing the instance.

Enter Online Repair

Create online-repair-on.json:

1{
2 "onlineRepair": {
3 "enabled": true,
4 "policy": {
5 "allowAutoInstanceDeletionOnFailure": false
6 },
7 "acknowledgments": {
8 "acceptDataCorruptionRisk": true,
9 "acceptRepairTeamAccess": true,
10 "acceptInstanceDeletionRisk": true
11 }
12 },
13 "healthIssue": {
14 "category": "Hardware",
15 "summary": "GPU diagnostics show intermittent ECC errors",
16 "details": "Tenant observed intermittent GPU ECC errors during workload execution. Please perform online repair while preserving the assigned instance."
17 }
18}

Run:

$restish carbide-stg update-machine <tenant-org-id> <machine-id> < online-repair-on.json

Expected result:

  • The API returns the updated machine.
  • The assigned instance moves to Repairing.
  • A site health override is applied for the tenant-reported repair request.
  • The instance remains assigned to the tenant.

Set allowAutoInstanceDeletionOnFailure to false unless the tenant explicitly authorizes the platform to delete the instance if online repair fails.

Health Issue Fields

healthIssue is required when entering online repair.

FieldRequirement
categoryRequired. Must be one of Hardware, Network, Performance, Storage, Software, or Other.
summaryRequired. Maximum 512 characters.
detailsRequired. Maximum 8192 characters.

Use a short operational summary and detailed reproduction or evidence. The summary is used in the tenant-facing health message.

Exit Online Repair

After the repair team confirms that the issue is fixed, create online-repair-off.json:

1{
2 "onlineRepair": {
3 "enabled": false
4 }
5}

Run:

$restish carbide-stg update-machine <tenant-org-id> <machine-id> < online-repair-off.json

Expected result:

  • The online repair health override is removed.
  • Online repair metadata is removed from the instance.
  • The assigned instance moves back to Ready.

Do not include healthIssue, policy, or acknowledgments when exiting online repair.

If online repair failed and the machine now needs disruptive repair, this exit step is still required before releasing the instance. An instance cannot be released for full repair while it remains in online repair. Clear online repair, confirm the instance is back in Ready, and then follow Release Instance for Full Repair.

Mark Failed Online Repair

If online repair is being cleared because the repair failed, update the instance labels before releasing the instance for full repair. This leaves a visible breadcrumb for tenant and operator tooling after the instance leaves online repair.

The REST API operation is:

PATCH /v2/org/{org}/nico/instance/{instanceId}

The OpenAPI operation ID is update-instance. Restish command shape:

$restish <api-profile> update-instance <tenant-org-id> <instance-id> < <request-body-json>

First inspect the instance and preserve any existing labels. Instance label updates replace the full label map; labels not included in the update request are removed. Labels are limited to 10 key/value pairs, so use the minimum failure labels if the instance is already near that limit.

$restish carbide-stg get-instance <tenant-org-id> <instance-id>

Create online-repair-failed-labels.json using the existing labels plus the failure labels:

1{
2 "labels": {
3 "env": "staging",
4 "owner": "tenant-platform",
5 "onlineRepair.status": "Failed",
6 "onlineRepair.escalation": "FullRepair",
7 "onlineRepair.failureReason": "GPU ECC errors persisted after online repair"
8 }
9}

Run:

$restish carbide-stg update-instance <tenant-org-id> <instance-id> < online-repair-failed-labels.json

Recommended labels:

LabelValuePurpose
onlineRepair.statusFailedShows that online repair was attempted and did not resolve the issue.
onlineRepair.escalationFullRepairShows that the next step is full repair release.
onlineRepair.failureReasonShort reasonCaptures the concise failure reason. Keep within the label value length limit.

After the label update succeeds, release the instance using Release Instance for Full Repair.

Verification

Use the deployment’s normal machine and instance inspection commands after each step. With Restish, the common pattern is:

$restish carbide-stg get-machine <tenant-org-id> <machine-id>

Check that:

  • The machine still has the same assigned instance.
  • The instance state is Repairing after enabling online repair.
  • The instance state is Ready after disabling online repair.
  • The online repair health alert is present while online repair is active and cleared after exit.

Validation Rules

The update request must only contain one kind of machine update. Do not combine onlineRepair with label updates, maintenance mode updates, instance type updates, or clearInstanceType.

Entering online repair requires:

  • onlineRepair.enabled: true
  • onlineRepair.policy.allowAutoInstanceDeletionOnFailure
  • All three acknowledgment fields set to true
  • A valid healthIssue

Exiting online repair requires:

  • onlineRepair.enabled: false
  • No healthIssue
  • No onlineRepair.policy
  • No onlineRepair.acknowledgments

Troubleshooting

ErrorMeaningAction
403 ForbiddenCaller is not a provider admin or privileged tenant admin for this machine.Use an account with the required tenant admin privileges and repair capability.
Machine must be assigned to an InstanceThe machine has no active tenant instance.Online repair is not applicable. Use the full repair workflow or provider repair process.
Instance must be in Ready state to enter online repairThe assigned instance is not eligible for online repair.Wait for the instance to become Ready, or use full repair if it cannot recover.
healthIssue is required when onlineRepair.enabled is trueThe enter request did not include issue details.Add a valid healthIssue object.
healthIssue, onlineRepair.policy, and onlineRepair.acknowledgments must not be set when exiting online repairThe exit request included enter-only fields.Use only { "onlineRepair": { "enabled": false } }.

If online repair fails or requires disruptive work, clear online repair first and then release the instance for full repair using Release Instance for Full Repair.