This page describes how NICo coordinates full machine repair after a tenant releases an instance with a reported machine issue. It is intended for provider operators, repair automation owners, and platform engineers who need to understand the health overrides, repair tenant behavior, and manual recovery paths behind the tenant-facing repair runbooks.
Tenant admins should start with one of these runbooks:
Platform admins and repair tenant operators should use Repair Tenant Workflow for the step-by-step targeted instance creation, repair outcome labeling, and repair tenant release procedure.
Full repair begins when a tenant releases an instance and includes a machineHealthIssue in the request body:
Restish example:
The release request removes the tenant assignment. NICo then uses the reported issue and site configuration to decide whether the machine should wait for manual intervention or be made available to repair automation.
The high-level flow is:
NICo uses health overrides to keep repair intent visible and to prevent unsafe allocation while a machine is under investigation.
Full repair and online repair are intentionally different:
Ready to Repairing.onlineRepair.status: Failed, before releasing the instance. This preserves failure context for tenant and provider tooling during the escalation.Auto-repair is controlled by the site API configuration:
When auto-repair is enabled, a tenant release with machineHealthIssue applies both tenant-reported-issue and repair-request.
When auto-repair is disabled, only tenant-reported-issue is applied. The machine remains unavailable for normal allocation until a provider operator clears the issue or manually triggers repair. If provider operations use a repair tenant and expect the repair tenant release to clear or reroute repair health overrides automatically, a repair-request override must be present before that release.
Repair tenants or repair automation use targeted provisioning to claim machines marked for repair. A repair tenant release is different from the original tenant’s release:
machineHealthIssue.isRepairTenant: true.isRepairTenant: true requires the tenant to have the targeted instance creation capability.repair_status: InProgress after claiming the machine, then set the final repair_status before releasing the repair instance. This prevents stale completion labels from older repair attempts.repair_status: Completed with no new issue returns the machine toward the ready pool; failed, incomplete, missing, or unknown status keeps the machine blocked for repair-failed or manual handling.Repair automation must report the repair result before releasing the repair instance. The expected repair status metadata is:
If the repair tenant releases the machine without a successful completion signal, NICo treats the repair as incomplete. This prevents a partially repaired or unverified machine from returning to normal tenant allocation.
The completion outcomes below apply when the machine has an active repair-request override at the time of the repair tenant release. Without repair-request, NICo does not clear tenant-reported-issue from a successful repair tenant release with no new issue.
When repair is successful and the repair tenant reports completion:
Provider verification:
Expected state:
repair-request override.tenant-reported-issue override for the repaired issue.When repair fails or the repair tenant releases without a successful completion signal:
Provider verification:
Expected state:
tenant-reported-issue remains present.Use admin tooling for provider-only recovery actions. The exact command syntax can vary by deployment and CLI version, but the common operations are:
Inspect machine state:
Manually request repair:
Clear a resolved tenant-reported issue:
Clear a stale repair request:
Escalate a machine that needs manual provider investigation:
Do not clear repair-related overrides unless the repair outcome is known. Clearing them early can return an unhealthy machine to tenant allocation.
If auto_machine_repair_plugin.enabled is false, a tenant release with machineHealthIssue records the issue but does not trigger repair automation.
Provider action:
If auto-repair is enabled but no repair tenant claims the machine:
repair-request is present.Common causes include repair tenant capacity limits, missing allocation criteria, repair automation downtime, or site connectivity issues.
If the repair tenant reports success but validation or monitoring still shows the issue:
tenant-reported-issue.OutForRepair style override if the machine should not re-enter automated repair immediately.Avoid repeatedly sending the same machine through automated repair without new evidence. That can create loops where the repair system keeps claiming and releasing the same unhealthy machine.
The tenant-facing runbooks define when a tenant should use each repair path:
This page covers what provider systems do after the tenant has chosen the full repair path.
For the operational runbook used by a repair tenant, see Repair Tenant Workflow.