This runbook is for platform admins, repair tenant admins, and repair automation owners who pick up a machine after a tenant has released it for full repair. The repair tenant claims the exact machine with targeted instance creation, performs diagnostics and repair, records the repair outcome on the machine, and releases the repair instance so NICo can decide whether the machine returns to the ready pool or stays quarantined for failed repair handling.
For the original tenant release path, see Release Instance for Full Repair. For the lower-level health override behavior, see Repair System Integration.
The caller needs tenant admin access for the dedicated repair tenant and the repair tenant must have targeted instance creation enabled. The repair tenant also needs access to the site, VPC, operating system, and network resources used for repair instances.
The repair tenant release path uses isRepairTenant: true. The REST API only accepts this flag from tenants with targeted instance creation capability.
The platform repair workflow is:
tenant-reported-issue and repair-request.machineId.allowUnhealthyMachine: true when the machine is repair-eligible but blocked from normal allocation by status or health.repair_status: InProgress so any stale status from an older repair cannot be reused accidentally.repair_status.isRepairTenant: true.The main REST operations are:
Restish exposes OpenAPI operation IDs as commands. The commands use the path arguments from the REST operation and accept JSON bodies through shell redirection:
For example, in staging:
carbide-stg is the Restish API profile or environment. Replace it with the profile for your deployment.
Use Restish help to confirm operation signatures in the target environment:
Restish prints the HTTP status and JSON error body when a request fails. Use that response when troubleshooting validation, permission, or workflow errors.
Collect the following values:
Confirm these preconditions:
tenant-reported-issue and repair-request overrides if the repair tenant release is expected to route the machine automatically back to the ready pool or repair-failed handling.allowUnhealthyMachine: true can target machines that are not normal-allocation ready, but it does not bypass missing machines, already-assigned machines, or controller states that cannot provision an instance.When auto-repair is disabled, the original tenant release applies tenant-reported-issue but not repair-request. In that case, provider operations must either manually add repair-request before the repair tenant workflow or manually clear the resolved tenant-reported-issue after a successful repair. Without repair-request, a successful repair tenant release with no new issue does not clear tenant-reported-issue.
Create repair-instance.json:
Run:
Expected result:
Use the repair instance only for diagnostics, firmware work, component validation, and other repair activity. Do not use the repair tenant as a normal workload tenant.
After the repair instance is created, mark the machine as actively being repaired. This avoids a stale repair_status: Completed label from an older repair being interpreted as the outcome for the current repair attempt.
First inspect the machine and preserve any labels that should remain:
Create repair-status-in-progress-labels.json using the existing labels plus repair_status: InProgress:
Run:
Run the repair procedure required by the issue. Use the repair ticket or tenant-reported issue details to preserve the failure context. Before release, validate the machine enough to decide whether it is safe to return to tenant allocation.
The final release decision is controlled by the machine label repair_status, not by the repair instance label. Set this label on the machine before releasing the repair instance.
Supported values are case-insensitive:
Before releasing the repair instance, inspect the machine again and preserve any labels that should remain:
Machine label updates replace the full label map. Labels not included in the update request are removed. Labels are limited to 10 key/value pairs, so keep repair labels short and preserve required placement labels such as rack, site, or pool hints.
When repair succeeds, create repair-status-completed-labels.json:
Run:
When repair fails or the machine must not return to the ready pool, create repair-status-failed-labels.json:
Run:
Use repair_status: Completed only after the repair team has validated that the machine can safely re-enter normal allocation. Use repair_status: Failed when the machine should move to repair-failed or manual intervention handling.
After setting repair_status: Completed, create repair-release-completed.json:
Run:
Expected result:
202 Accepted.repair-request.tenant-reported-issue because no new issue was reported.If repair failed, validation failed, or the machine should not return to normal allocation, set repair_status: Failed and include a machine health issue in the repair release.
Create repair-release-failed.json:
Run:
Expected result:
202 Accepted.repair-request so automated repair does not loop on the same machine.tenant-reported-issue.If the repair tenant releases the instance with repair_status: Failed, InProgress, missing, or unknown and does not provide machineHealthIssue, NICo creates a fallback issue with summary RepairSystem processing incomplete.
If a repair tenant releases a machine that no longer has a repair-request override, NICo does not create a new automated repair loop. A release with a new machineHealthIssue applies tenant-reported-issue; a release with no issue takes no health override action and does not clear an existing tenant-reported-issue.
For machines that still have a repair-request override, NICo uses the following release behavior:
After releasing the repair instance, inspect the machine and health overrides:
Check that:
repair-request is removed after the repair tenant release.tenant-reported-issue is removed only for successful repair completion with no new issue.Provider tooling can also inspect the lower-level health overrides described in Repair System Integration.
Do not clear repair health overrides manually unless the repair outcome is known. The repair tenant release path exists so NICo can make the ready-pool decision from the recorded repair outcome.