K8s Cluster Scale-out
At this point workers should be added to the cluster. As workers are added to the cluster, DPUs will be provisioned and DPUServices will begin to be spun up.
Return to the shell where Kubespray was previously run to deploy the cluster, unmark the
kube_node
group in thehosts.yaml
file, and add the worker nodes to the cluster:NoteEnsure you are in the Python virtual environment (
.venv
) when running the command.Jump Node Console
(.venv) depuser@jump:~/kubespray$ cat inventory/mycluster/hosts.yaml ... k8s_cluster: children: kube_control_plane: kube_node: ... (.venv) depuser@jump:~/kubespray$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root scale.yml
The scale-out shouldn't take a long time, and a successful run should look similar to the following output:
To follow the progress of the DPU provisioning, run the following command to check in which phase it currently is:
Jump Node Console
$ watch -n10 "kubectl describe dpu -n dpf-operator-system | grep 'Node Name\|Type\|Last\|Phase'" Every 10.0s: kubectl describe dpu -n dpf-operator-system | grep 'Node Name\|Type\|Last\|Phase' jump: Tue May 20 14:54:41 2025 Dpu Node Name: worker1 Last Transition Time: 2025-05-20T14:51:54Z Type: Initialized Last Transition Time: 2025-05-20T14:51:54Z Type: BFBReady Last Transition Time: 2025-05-20T14:52:09Z Type: NodeEffectReady Last Transition Time: 2025-05-20T14:52:10Z Type: InterfaceInitialized Last Transition Time: 2025-05-20T14:52:11Z Type: FWConfigured Phase: OS Installing Dpu Node Name: worker2 Last Transition Time: 2025-05-20T14:50:34Z Type: Initialized Last Transition Time: 2025-05-20T14:50:34Z Type: BFBReady Last Transition Time: 2025-05-20T14:50:49Z Type: NodeEffectReady Last Transition Time: 2025-05-20T14:50:50Z Type: InterfaceInitialized Last Transition Time: 2025-05-20T14:50:51Z Type: FWConfigured Phase: OS Installing
Validate that the DPUs have been provisioned successfully by ensuring they're in ready state:
Jump Node Console
$ kubectl wait --for=condition=ready --namespace dpf-operator-system dpu --all dpu.provisioning.dpu.nvidia.com/worker1-0000-89-00 condition met dpu.provisioning.dpu.nvidia.com/worker2-0000-89-00 condition met
Ensure that the following DaemonSets have 2 ready replicas:
Jump Node Console
$ kubectl wait ds --for=jsonpath='{.status.numberReady}'=2 --namespace nvidia-network-operator kube-multus-ds sriov-network-config-daemon sriov-device-plugin daemonset.apps/kube-multus-ds condition met daemonset.apps/sriov-network-config-daemon condition met daemonset.apps/sriov-device-plugin condition met $ kubectl wait ds --for=jsonpath='{.status.numberReady}'=2 --namespace ovn-kubernetes ovn-kubernetes-node-dpu-host daemonset.apps/ovn-kubernetes-node-dpu-host condition met
Validate that all the different DPUServices, DPUServiceIPAMs, DPUServiceInterfaces and DPUServiceChains objects are now in ready state
Jump Node Console
$ kubectl wait --for=condition=ApplicationsReady --namespace dpf-operator-system dpuservices -l svc.dpu.nvidia.com/owned-by-dpudeployment=dpf-operator-system_ovn-hbn dpuservice.svc.dpu.nvidia.com/blueman-kqm2q condition met dpuservice.svc.dpu.nvidia.com/dts-b8vfs condition met dpuservice.svc.dpu.nvidia.com/hbn-2rglk condition met dpuservice.svc.dpu.nvidia.com/ovn-5tr2j condition met $ kubectl wait --for=condition=DPUIPAMObjectReady --namespace dpf-operator-system dpuserviceipam --all dpuserviceipam.svc.dpu.nvidia.com/loopback condition met dpuserviceipam.svc.dpu.nvidia.com/pool1 condition met $ kubectl wait --for=condition=ServiceInterfaceSetReady --namespace dpf-operator-system dpuserviceinterface --all dpuserviceinterface.svc.dpu.nvidia.com/hbn-p0-if-tnkf8 condition met dpuserviceinterface.svc.dpu.nvidia.com/hbn-p1-if-ww8qv condition met dpuserviceinterface.svc.dpu.nvidia.com/hbn-pf2dpu2-if-7l5mk condition met dpuserviceinterface.svc.dpu.nvidia.com/ovn condition met dpuserviceinterface.svc.dpu.nvidia.com/p0 condition met dpuserviceinterface.svc.dpu.nvidia.com/p1 condition met $ kubectl wait --for=condition=ServiceChainSetReady --namespace dpf-operator-system dpuservicechain --all dpuservicechain.svc.dpu.nvidia.com/ovn-hbn-6lkvj condition met
Congratulations, the DPF system has been successfully installed!