DPU Provisioning Troubleshooting

This troubleshooting guide is intended for people who are responsible for maintaining and administering DPU provisioning.

Symptoms

When you try to provision DPU with an error provisioning script, this will set state of BMH to “provisioning error”. The de-provisioning phase may get stuck in deleting BMH.

Resolution

Applying the following patch to skip the de-provisioning phase when deleting the DPU:

Copy
Copied!
            

kubectl patch bmh <baremetalhost-cr-name> -n <universe> -p '{"spec":{"automatedCleaningMode":"disabled"}}' --type="merge" kubectl delete dpu <dpu-cr-name> -n universe

Note

Skipping the de-provisioning phase will leave the OS and containers and daemons lingering in the DPU.

Symptoms

During provisioning SettingProviderIDOnNodeFailed error message appears if one runs the clusterctl describe on DCM node.

Copy
Copied!
            

root@dcm05:/opt/regression# clusterctl describe cluster dcm-cluster -n universe NAME READY SEVERITY REASON SINCE MESSAGE Cluster/dcm-cluster True 55m ClusterInfrastructure - Metal3Cluster/dcm-cluster True 55m ControlPlane - UniverseControlPlane/dcm-cluster-control-plane True 55m Workers Other Machine/hpc-cloud05-bf1 False Error SettingProviderIDOnNodeFailed 10m 1 of 2 completed

Cause

This is a normal behavior in CAPI. This message is automatically removed when cloud-init is re-registered.

Symptoms

During provisioning AssociatedBMHFailure error message appears if one runs the clusterctl describe on DCM node.

Cause

This is a normal behavior in CAPI. This message is automatically removed when BareMetalHost’s status become ready.

Symptoms

After uninstall the provisioning components through helm uninstall command, the DPU, BareMetalHost, Metal3Machine CRs are left in the cluster.

Cause

These CRs are created by cloud admin or controllers. In this release, the helm uninstall command does not delete the CRs that were not created by a helm install.

Previous universe-vault
Next 0.5.0-dev
© Copyright 2023, NVIDIA. Last updated on Feb 7, 2024.