Instance and Fabric Issues
Use this playbook when an instance is stuck, capacity is unavailable, or GPU, InfiniBand, or NVLink state blocks assignment or release.
Start with Instance and Host State
Look for:
- assigned host
- host state and substate
- health alerts
- validation failures
- fabric or network config wait reasons
GPU Issues
NICo allocates whole bare-metal hosts, not individual GPUs. GPU-related issues usually appear as allocation blocks, validation failures, health alerts, or tenant-reported in-life failures.
Check:
Useful metrics:
carbide_gpus_total_countcarbide_gpus_usable_countcarbide_gpus_in_use_count
Common causes:
InfiniBand Issues
InfiniBand problems usually show up as partition programming failures, UFM sync drift, missing P_Keys, unexpected P_Keys, or cleanup pending alerts.
Check:
Look for:
IbPortDownIbCleanupPending- missing P_Keys
- unexpected P_Keys
- UFM API errors
NVLink Partition Issues
NVLink issues usually appear during placement, attach, detach, cleanup, or domain health checks.
Check:
Common causes:
Release and Cleanup
On termination, cleanup may still need to detach network, InfiniBand, NVLink, or other fabric state before the host can return to the pool.
Do not force delete first. Confirm what cleanup step is blocked, then fix the fabric or controller dependency that owns that cleanup.