DPU Provisioning Troubleshooting
This troubleshooting guide is intended for people who are responsible for maintaining and administering DPU provisioning.
Symptoms
When you try to provision DPU with an error provisioning script, this will set state of BMH to “provisioning error”. The de-provisioning phase may get stuck in deleting BMH.
Resolution
Applying the following patch to skip the de-provisioning phase when deleting the DPU:
kubectl patch bmh <baremetalhost-cr-name> -n <universe> -p '{"spec":{"automatedCleaningMode":"disabled"}}' --type="merge"
kubectl delete dpu <dpu-cr-name> -n universe
Skipping the de-provisioning phase will leave the OS and containers and daemons lingering in the DPU.
Symptoms
During provisioning SettingProviderIDOnNodeFailed error message appears if one runs the clusterctl describe on DCM node.
root@dcm05:/opt/regression# clusterctl describe cluster dcm-cluster -n universe
NAME READY SEVERITY REASON SINCE MESSAGE
Cluster/dcm-cluster True 55m
ClusterInfrastructure - Metal3Cluster/dcm-cluster True 55m
ControlPlane - UniverseControlPlane/dcm-cluster-control-plane True 55m
Workers
Other
Machine/hpc-cloud05-bf1 False Error SettingProviderIDOnNodeFailed 10m 1 of 2 completed
Cause
This is a normal behavior in CAPI. This message is automatically removed when cloud-init is re-registered.
Symptoms
During provisioning AssociatedBMHFailure error message appears if one runs the clusterctl describe on DCM node.
Cause
This is a normal behavior in CAPI. This message is automatically removed when BareMetalHost’s status become ready.
Symptoms
After uninstall the provisioning components through helm uninstall command, the DPU, BareMetalHost, Metal3Machine CRs are left in the cluster.
Cause
These CRs are created by cloud admin or controllers. In this release, the helm uninstall command does not delete the CRs that were not created by a helm install.