For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
GitHub
DocumentationREST API Reference
DocumentationREST API Reference
    • Home
  • Overview
    • What is NICo?
    • Key Capabilities
    • Operational Principles
    • Day 0 / Day 1 / Day 2 Lifecycle
    • Scope and Boundaries
  • Getting Started
    • Building NICo Containers
    • Quick Start Guide
  • Provisioning (Day 0 Operations)
    • Ingesting Hosts
    • Ingesting Hosts (REST API)
    • Host Validation
    • SKU Validation
  • DPU Management
    • DPU Lifecycle Management
    • DPU Configuration
    • BlueField DPU Operations
  • Architecture
    • Overview and Components
    • Redfish Workflow
    • Redfish Endpoints Reference
    • Reliable State Handling
    • Networking Integrations
    • Health Checks and Health Aggregation
    • Health Probe IDs
    • Health Alert Classifications
    • Key Group Synchronization
  • Operations
    • NVLink Partitioning
      • Overview
      • Online Repair
      • Release Instance for Full Repair
      • Repair Tenant Workflow
      • Repair System Integration
    • IP Resource Pools
    • BGP Peering
  • Playbooks
    • Azure OIDC for Infra Controller Web UI
    • Force Deleting and Rebuilding Hosts
    • Rebooting a Machine
    • InfiniBand Setup
  • Development
    • Codebase Overview
    • Bootable Artifacts
    • Local Development
    • Running a PXE Client in a VM
    • TLS and SPIFFE Certificates
    • SPIFFE and casbin policies with admin-cli
    • Re-creating Issuer/CA in Local Dev
    • Visual Studio Code Remote Development
    • Adding Support for New Hardware
    • Build Guide
  • Reference
    • Hardware Compatibility List
    • Release Notes
    • FAQs
    • Glossary
GitHub
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogo
On this page
  • Audience and Access
  • Workflow Summary
  • REST and Restish
  • Before You Start
  • Claim the Machine
  • Mark Repair In Progress
  • Perform and Validate Repair
  • Set Repair Outcome Labels
  • Release a Successfully Repaired Machine
  • Release a Machine That Still Needs Repair
  • Completion Matrix
  • Verification
  • Troubleshooting
OperationsRepair Workflows

Repair Tenant Workflow

||View as Markdown|
Previous

Release Instance for Full Repair

Next

Repair System Integration

This runbook is for platform admins, repair tenant admins, and repair automation owners who pick up a machine after a tenant has released it for full repair. The repair tenant claims the exact machine with targeted instance creation, performs diagnostics and repair, records the repair outcome on the machine, and releases the repair instance so NICo can decide whether the machine returns to the ready pool or stays quarantined for failed repair handling.

For the original tenant release path, see Release Instance for Full Repair. For the lower-level health override behavior, see Repair System Integration.

Audience and Access

The caller needs tenant admin access for the dedicated repair tenant and the repair tenant must have targeted instance creation enabled. The repair tenant also needs access to the site, VPC, operating system, and network resources used for repair instances.

The repair tenant release path uses isRepairTenant: true. The REST API only accepts this flag from tenants with targeted instance creation capability.

Workflow Summary

The platform repair workflow is:

  1. Identify a released machine that has repair signals such as tenant-reported-issue and repair-request.
  2. Create a repair instance in the dedicated repair tenant by targeting the exact machineId.
  3. Use allowUnhealthyMachine: true when the machine is repair-eligible but blocked from normal allocation by status or health.
  4. Set the machine label repair_status: InProgress so any stale status from an older repair cannot be reused accidentally.
  5. Diagnose and fix the machine.
  6. Update the machine labels with the final repair outcome, especially repair_status.
  7. Release the repair tenant instance with isRepairTenant: true.
  8. NICo clears or reapplies repair health overrides based on the machine label and release payload.

REST and Restish

The main REST operations are:

StepREST operationRestish operation
Create targeted repair instancePOST /v2/org/{org}/nico/instancecreate-instance
Inspect machineGET /v2/org/{org}/nico/machine/{machineId}get-machine
Update machine labelsPATCH /v2/org/{org}/nico/machine/{machineId}update-machine
Release repair instanceDELETE /v2/org/{org}/nico/instance/{instanceId}delete-instance

Restish exposes OpenAPI operation IDs as commands. The commands use the path arguments from the REST operation and accept JSON bodies through shell redirection:

$restish <api-profile> create-instance <repair-tenant-org-id> < repair-instance.json
$restish <api-profile> update-machine <repair-tenant-org-id> <machine-id> < repair-status.json
$restish <api-profile> delete-instance <repair-tenant-org-id> <repair-instance-id> < repair-release.json

For example, in staging:

$restish carbide-stg create-instance <repair-tenant-org-id> < repair-instance.json

carbide-stg is the Restish API profile or environment. Replace it with the profile for your deployment.

Use Restish help to confirm operation signatures in the target environment:

$restish carbide-stg create-instance --help
$restish carbide-stg update-machine --help
$restish carbide-stg delete-instance --help

Restish prints the HTTP status and JSON error body when a request fails. Use that response when troubleshooting validation, permission, or workflow errors.

Before You Start

Collect the following values:

ValueDescription
<api-profile>Restish profile, for example carbide-stg.
<repair-tenant-org-id>Organization identifier for the dedicated repair tenant.
<repair-tenant-id>Tenant UUID used in the create instance request body.
<machine-id>Machine ID being repaired.
<repair-vpc-id>VPC UUID used by repair instances.
<repair-vpc-prefix-id>VPC prefix UUID for the repair interface.
<repair-os-id>Operating system UUID or an approved iPXE script for repair work.
<repair-instance-id>Repair instance UUID returned by targeted instance creation.

Confirm these preconditions:

  • The original tenant has released the instance for full repair.
  • Online repair is not active for the original tenant instance.
  • The machine is not assigned to another instance.
  • The machine has tenant-reported-issue and repair-request overrides if the repair tenant release is expected to route the machine automatically back to the ready pool or repair-failed handling.
  • The repair tenant has targeted instance creation capability.
  • The repair tenant has site access and can see the target machine.
  • The machine controller state is still provisionable. allowUnhealthyMachine: true can target machines that are not normal-allocation ready, but it does not bypass missing machines, already-assigned machines, or controller states that cannot provision an instance.

When auto-repair is disabled, the original tenant release applies tenant-reported-issue but not repair-request. In that case, provider operations must either manually add repair-request before the repair tenant workflow or manually clear the resolved tenant-reported-issue after a successful repair. Without repair-request, a successful repair tenant release with no new issue does not clear tenant-reported-issue.

Claim the Machine

Create repair-instance.json:

1{
2 "name": "repair-<machine-id>",
3 "description": "Repair instance for machine <machine-id>",
4 "tenantId": "<repair-tenant-id>",
5 "machineId": "<machine-id>",
6 "vpcId": "<repair-vpc-id>",
7 "operatingSystemId": "<repair-os-id>",
8 "allowUnhealthyMachine": true,
9 "interfaces": [
10 {
11 "vpcPrefixId": "<repair-vpc-prefix-id>",
12 "isPhysical": true
13 }
14 ],
15 "labels": {
16 "repair.workflow": "MachineRepair",
17 "repair.machineId": "<machine-id>",
18 "repair.ticket": "INC-12345"
19 }
20}

Run:

$restish carbide-stg create-instance <repair-tenant-org-id> < repair-instance.json

Expected result:

  • The API returns the repair instance.
  • The repair instance is assigned to the requested machine.
  • The machine remains blocked from normal tenant allocation while repair health overrides are active.

Use the repair instance only for diagnostics, firmware work, component validation, and other repair activity. Do not use the repair tenant as a normal workload tenant.

Mark Repair In Progress

After the repair instance is created, mark the machine as actively being repaired. This avoids a stale repair_status: Completed label from an older repair being interpreted as the outcome for the current repair attempt.

First inspect the machine and preserve any labels that should remain:

$restish carbide-stg get-machine <repair-tenant-org-id> <machine-id>

Create repair-status-in-progress-labels.json using the existing labels plus repair_status: InProgress:

1{
2 "labels": {
3 "RackIdentifier": "GVX11F01C02",
4 "repair_status": "InProgress",
5 "repair.ticket": "INC-12345",
6 "repair.summary": "Repair tenant diagnostics started"
7 }
8}

Run:

$restish carbide-stg update-machine <repair-tenant-org-id> <machine-id> < repair-status-in-progress-labels.json

Perform and Validate Repair

Run the repair procedure required by the issue. Use the repair ticket or tenant-reported issue details to preserve the failure context. Before release, validate the machine enough to decide whether it is safe to return to tenant allocation.

The final release decision is controlled by the machine label repair_status, not by the repair instance label. Set this label on the machine before releasing the repair instance.

Supported values are case-insensitive:

Machine labelMeaningNICo release result
repair_status: CompletedRepair succeeded and validation passed.If the release has no new issue, NICo removes repair-request and tenant-reported-issue, allowing the machine to return to the ready pool.
repair_status: FailedRepair did not resolve the issue.NICo removes repair-request and applies or keeps tenant-reported-issue, keeping the machine in repair or failed handling.
repair_status: InProgressRepair work is not complete.NICo treats the release as incomplete and keeps the machine out of the ready pool.
Missing or unknown valueNo trusted completion signal.NICo creates a fallback incomplete-repair issue and keeps the machine out of the ready pool.

Set Repair Outcome Labels

Before releasing the repair instance, inspect the machine again and preserve any labels that should remain:

$restish carbide-stg get-machine <repair-tenant-org-id> <machine-id>

Machine label updates replace the full label map. Labels not included in the update request are removed. Labels are limited to 10 key/value pairs, so keep repair labels short and preserve required placement labels such as rack, site, or pool hints.

When repair succeeds, create repair-status-completed-labels.json:

1{
2 "labels": {
3 "RackIdentifier": "GVX11F01C02",
4 "repair_status": "Completed",
5 "repair.ticket": "INC-12345",
6 "repair.summary": "GPU riser replaced and validation passed"
7 }
8}

Run:

$restish carbide-stg update-machine <repair-tenant-org-id> <machine-id> < repair-status-completed-labels.json

When repair fails or the machine must not return to the ready pool, create repair-status-failed-labels.json:

1{
2 "labels": {
3 "RackIdentifier": "GVX11F01C02",
4 "repair_status": "Failed",
5 "repair.ticket": "INC-12345",
6 "repair.summary": "GPU ECC errors persist after riser replacement"
7 }
8}

Run:

$restish carbide-stg update-machine <repair-tenant-org-id> <machine-id> < repair-status-failed-labels.json

Use repair_status: Completed only after the repair team has validated that the machine can safely re-enter normal allocation. Use repair_status: Failed when the machine should move to repair-failed or manual intervention handling.

Release a Successfully Repaired Machine

After setting repair_status: Completed, create repair-release-completed.json:

1{
2 "isRepairTenant": true
3}

Run:

$restish carbide-stg delete-instance <repair-tenant-org-id> <repair-instance-id> < repair-release-completed.json

Expected result:

  • The API returns 202 Accepted.
  • NICo releases the repair instance.
  • NICo removes repair-request.
  • NICo removes tenant-reported-issue because no new issue was reported.
  • The machine becomes eligible for the normal ready pool after cleanup and validation.

Release a Machine That Still Needs Repair

If repair failed, validation failed, or the machine should not return to normal allocation, set repair_status: Failed and include a machine health issue in the repair release.

Create repair-release-failed.json:

1{
2 "isRepairTenant": true,
3 "machineHealthIssue": {
4 "category": "Hardware",
5 "summary": "Repair failed: GPU ECC errors persist",
6 "details": "Repair tenant replaced the GPU riser, but validation still reports ECC errors. Keep the machine out of the ready pool for provider intervention."
7 }
8}

Run:

$restish carbide-stg delete-instance <repair-tenant-org-id> <repair-instance-id> < repair-release-failed.json

Expected result:

  • The API returns 202 Accepted.
  • NICo releases the repair instance.
  • NICo removes repair-request so automated repair does not loop on the same machine.
  • NICo applies or keeps tenant-reported-issue.
  • The machine stays out of the normal ready pool and is routed to repair-failed or manual intervention handling.

If the repair tenant releases the instance with repair_status: Failed, InProgress, missing, or unknown and does not provide machineHealthIssue, NICo creates a fallback issue with summary RepairSystem processing incomplete.

If a repair tenant releases a machine that no longer has a repair-request override, NICo does not create a new automated repair loop. A release with a new machineHealthIssue applies tenant-reported-issue; a release with no issue takes no health override action and does not clear an existing tenant-reported-issue.

Completion Matrix

For machines that still have a repair-request override, NICo uses the following release behavior:

Machine label before releaseRelease issue providedResult
repair_status: CompletedNoMachine can return to the ready pool.
repair_status: CompletedYesMachine remains blocked by the new tenant-reported-issue.
repair_status: FailedOptionalMachine remains blocked by tenant-reported-issue.
repair_status: InProgressOptionalMachine remains blocked as incomplete repair.
Missing or unknown repair_statusOptionalMachine remains blocked as incomplete repair.

Verification

After releasing the repair instance, inspect the machine and health overrides:

$restish carbide-stg get-machine <repair-tenant-org-id> <machine-id>

Check that:

  • The repair instance is terminating or deleted.
  • repair-request is removed after the repair tenant release.
  • tenant-reported-issue is removed only for successful repair completion with no new issue.
  • A failed, incomplete, or unknown repair still keeps the machine unavailable for normal tenant allocation.

Provider tooling can also inspect the lower-level health overrides described in Repair System Integration.

Troubleshooting

ErrorMeaningAction
Tenant does not have capability to create Instances using specific Machine IDThe repair tenant is missing targeted instance creation.Enable targeted instance creation for the repair tenant or use the correct repair tenant.
Machine is not in Ready state, but it can be provisioned by setting allowUnhealthyMachine to trueThe machine can be targeted for repair but the create request omitted allowUnhealthyMachine.Add allowUnhealthyMachine: true and retry.
Machine is assigned to an InstanceThe machine is still assigned elsewhere.Confirm the original tenant release completed before claiming the machine.
Tenant does not have capability to set IsRepairTenantThe release caller is not a targeted-instance-capable tenant.Release from the repair tenant or use a properly privileged tenant.
Successful repair release does not return the machine to readyThe machine did not have repair-request when the repair tenant released it.Manually clear the resolved tenant-reported-issue. For future attempts, add repair-request before the repair tenant release if automatic routing is expected.
Machine returns to failed handling after a successful repairrepair_status was missing, not Completed, or the release included a new machineHealthIssue.Inspect machine labels, fix repair_status, and verify the release payload.

Do not clear repair health overrides manually unless the repair outcome is known. The repair tenant release path exists so NICo can make the ready-pool decision from the recorded repair outcome.