Online repair lets a privileged tenant admin report a machine health issue and move the assigned instance from Ready to Repairing without releasing the instance. Use this workflow when the repair can be attempted while the tenant keeps the instance assignment.
If online repair cannot fix the issue, clear online repair first and then release the instance for full repair. See Release Instance for Full Repair.
This page is intended for tenant admins and platform operators writing tenant-facing runbooks.
The caller must have access to the Infra Controller REST API through an API profile such as carbide-stg. The online repair operation is allowed for provider admins and privileged tenant admins. In tenant workflows, this means the tenant must have the required privileged capability for repair operations, such as targeted instance creation access.
When online repair is enabled:
Ready.Repairing.Repairing while the repair override is active and the instance is otherwise tenant-ready.When online repair is disabled:
Ready.The REST API operation is:
The OpenAPI operation ID is update-machine. Restish exposes operation IDs as commands, so the command shape is:
For example, in staging:
carbide-stg is the Restish API profile or environment. Replace it with the profile for your deployment.
Use Restish help to confirm the operation signature in the target environment:
Restish prints the HTTP status and JSON error body when a request fails. Use that response body when troubleshooting validation or permission errors.
Collect the following values:
Confirm these preconditions:
Ready.Create online-repair-on.json:
Run:
Expected result:
Repairing.Set allowAutoInstanceDeletionOnFailure to false unless the tenant explicitly authorizes the platform to delete the instance if online repair fails.
healthIssue is required when entering online repair.
Use a short operational summary and detailed reproduction or evidence. The summary is used in the tenant-facing health message.
After the repair team confirms that the issue is fixed, create online-repair-off.json:
Run:
Expected result:
Ready.Do not include healthIssue, policy, or acknowledgments when exiting online repair.
If online repair failed and the machine now needs disruptive repair, this exit step is still required before releasing the instance. An instance cannot be released for full repair while it remains in online repair. Clear online repair, confirm the instance is back in Ready, and then follow Release Instance for Full Repair.
If online repair is being cleared because the repair failed, update the instance labels before releasing the instance for full repair. This leaves a visible breadcrumb for tenant and operator tooling after the instance leaves online repair.
The REST API operation is:
The OpenAPI operation ID is update-instance. Restish command shape:
First inspect the instance and preserve any existing labels. Instance label updates replace the full label map; labels not included in the update request are removed. Labels are limited to 10 key/value pairs, so use the minimum failure labels if the instance is already near that limit.
Create online-repair-failed-labels.json using the existing labels plus the failure labels:
Run:
Recommended labels:
After the label update succeeds, release the instance using Release Instance for Full Repair.
Use the deployment’s normal machine and instance inspection commands after each step. With Restish, the common pattern is:
Check that:
Repairing after enabling online repair.Ready after disabling online repair.The update request must only contain one kind of machine update. Do not combine onlineRepair with label updates, maintenance mode updates, instance type updates, or clearInstanceType.
Entering online repair requires:
onlineRepair.enabled: trueonlineRepair.policy.allowAutoInstanceDeletionOnFailuretruehealthIssueExiting online repair requires:
onlineRepair.enabled: falsehealthIssueonlineRepair.policyonlineRepair.acknowledgmentsIf online repair fails or requires disruptive work, clear online repair first and then release the instance for full repair using Release Instance for Full Repair.