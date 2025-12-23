The following variables are required by this guide. A sensible default is provided where it makes sense, but many will be specific to the target infrastructure.

Commands in this guide are run in the same directory that contains this readme.

Environment variables file

Collapse Source Copy Copied! ## Virtual IP used by the load balancer for the DPU Cluster. Must be a reserved IP from the management subnet and not allocated by DHCP. export DPUCLUSTER_VIP= ## Interface on which the DPUCluster load balancer will listen. Should be the management interface of the control plane node. export DPUCLUSTER_INTERFACE= ## IP address of the NFS server used for storing the BFB image. ## NOTE: This environment variable does NOT control the address of the NFS server used as a remote target by SNAP VirtioFS. export NFS_SERVER_IP= ## The repository URL for the NVIDIA Helm chart registry. ## Usually this is the NVIDIA Helm NGC registry. For development purposes, this can be set to a different repository. export HELM_REGISTRY_REPO_URL=https: ## The repository URL for the HBN container image. ## Usually this is the NVIDIA NGC registry. For development purposes, this can be set to a different repository. export HBN_NGC_IMAGE_URL=nvcr.io/nvidia/doca/doca_hbn ## The repository URL for the SNAP VFS container image. ## Usually this is the NVIDIA NGC registry. For development purposes, this can be set to a different repository. export SNAP_NGC_IMAGE_URL=nvcr.io/nvidia/doca/doca_vfs ## The DPF REGISTRY is the Helm repository URL where the DPF Operator Chart resides. ## Usually this is the NVIDIA Helm NGC registry. For development purposes, this can be set to a different repository. export REGISTRY=https: ## The DPF TAG is the version of the DPF components which will be deployed in this guide. export TAG=v25. 10.0 ## URL to the BFB used in the `bfb.yaml` and linked by the DPUSet. export BFB_URL= "https://content.mellanox.com/BlueField/BFBs/Ubuntu24.04/bf-bundle-3.2.1-34_25.11_ubuntu-24.04_64k_prod.bfb"

Modify the variables in manifests/00-env-vars/envvars.env to fit your environment, then source the file:

Copy Copied! source manifests/ 00 -env-vars/envvars.env





A number of environment variables must be set before running this command.

Copy Copied! kubectl create ns dpf-operator-system cat manifests/ 01 -dpf-operator-installation/*.yaml | envsubst | kubectl apply -f -

This deploys the following objects:

PersistentVolume and PersistentVolumeClaim for the provisioning controller

Copy Copied! --- apiVersion: v1 kind: PersistentVolume metadata: name: bfb-pv spec: capacity: storage: 10Gi volumeMode: Filesystem accessModes: - ReadWriteMany nfs: path: /mnt/dpf_share/bfb server: $NFS_SERVER_IP persistentVolumeReclaimPolicy: Delete --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: bfb-pvc namespace: dpf-operator-system spec: accessModes: - ReadWriteMany resources: requests: storage: 10Gi volumeMode: Filesystem storageClassName: ""

Local Path Provisioner Helm values

Copy Copied! tolerations: - operator: Exists effect: NoSchedule key: node-role.kubernetes.io/control-plane - operator: Exists effect: NoSchedule key: node-role.kubernetes.io/master





Before deploying the DPF Operator, ensure that Helm is properly configured according to the Helm prerequisites.

Warning This is a critical prerequisite step that must be completed for the DPF Operator to function properly.





A number of environment variables must be set before running this command.

If the $REGISTRY is an HTTP Registry (default value) use this command:

Copy Copied! helm repo add --force-update dpf-repository ${REGISTRY} helm repo update helm upgrade --install -n dpf-operator-system dpf-operator dpf-repository/dpf-operator --version=$TAG

For development purposes, if the $REGISTRY is an OCI Registry use this command:

Copy Copied! helm upgrade --install -n dpf-operator-system dpf-operator $REGISTRY/dpf-operator --version=$TAG

These verification commands may need to be run multiple times to ensure the condition is met.

Verify the DPF Operator installation with:

Copy Copied! ## Ensure the DPF Operator deployment is available. kubectl rollout status deployment --namespace dpf-operator-system dpf-operator-controller-manager ## Ensure all pods in the DPF Operator system are ready. kubectl wait -- for =condition=ready --namespace dpf-operator-system pods --all

This section involves creating the DPF system components and some basic infrastructure required for a functioning DPF-enabled cluster.

A number of environment variables must be set before running this command.

Copy Copied! kubectl create ns dpu-cplane-tenant1 cat manifests/ 02 -dpf-system-installation/*.yaml | envsubst | kubectl apply -f -

This will create the following objects:

DPFOperatorConfig to install the DPF System components

Copy Copied! --- apiVersion: operator.dpu.nvidia.com/v1alpha1 kind: DPFOperatorConfig metadata: name: dpfoperatorconfig namespace: dpf-operator-system spec: provisioningController: bfbPVCName: "bfb-pvc" dmsTimeout: 900 kamajiClusterManager: disable: false

DPUCluster to serve as Kubernetes control plane for DPU nodes

Copy Copied! --- apiVersion: provisioning.dpu.nvidia.com/v1alpha1 kind: DPUCluster metadata: name: dpu-cplane-tenant1 namespace: dpu-cplane-tenant1 spec: type: kamaji maxNodes: 10 clusterEndpoint: # deploy keepalived instances on the nodes that match the given nodeSelector. keepalived: # interface on which keepalived will listen. Should be the oob interface of the control plane node. interface : $DPUCLUSTER_INTERFACE # Virtual IP reserved for the DPU Cluster load balancer. Must not be allocatable by DHCP. vip: $DPUCLUSTER_VIP # virtualRouterID must be in range [ 1 , 255 ], make sure the given virtualRouterID does not duplicate with any existing keepalived process running on the host virtualRouterID: 126 nodeSelector: node-role.kubernetes.io/control-plane: ""





These verification commands may need to be run multiple times to ensure the condition is met.

Verify the DPF System with:

Copy Copied! ## Ensure the provisioning and DPUService controller manager deployments are available. kubectl rollout status deployment --namespace dpf-operator-system dpf-provisioning-controller-manager dpuservice-controller-manager ## Ensure all other deployments in the DPF Operator system are Available. kubectl rollout status deployment --namespace dpf-operator-system ## Ensure the DPUCluster is ready for nodes to join. kubectl wait -- for =condition=ready --namespace dpu-cplane-tenant1 dpucluster --all

Copy Copied! helm repo add nvidia https: helm upgrade --no-hooks --install --create-namespace --namespace nvidia-network-operator network-operator nvidia/network-operator --version 25.7 . 0 -f ./manifests/ 03 -enable-accelerated-interfaces/helm-values/network-operator.yml

NVIDIA Network Operator Helm values

Collapse Source Copy Copied! nfd: enabled: false deployNodeFeatureRules: false sriovNetworkOperator: enabled: true sriov-network-operator: operator: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/master operator: Exists - matchExpressions: - key: node-role.kubernetes.io/control-plane operator: Exists crds: enabled: true sriovOperatorConfig: deploy: true configDaemonNodeSelector: null operator: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/master operator: Exists - matchExpressions: - key: node-role.kubernetes.io/control-plane operator: Exists





Copy Copied! cat manifests/ 03 -enable-accelerated-interfaces/*.yaml | envsubst | kubectl apply -f -

This will deploy the following objects:

NICClusterPolicy for the NVIDIA Network Operator

Copy Copied! --- apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: secondaryNetwork: multus: image: multus-cni imagePullSecrets: [] repository: ghcr.io/k8snetworkplumbingwg version: v3. 9.3





These verification commands may need to be run multiple times to ensure the condition is met.

Verify the DPF System with:

Copy Copied! ## Ensure the provisioning and DPUService controller manager deployments are available. kubectl wait -- for =condition=Ready --namespace nvidia-network-operator pods --all ## Expect the following Daemonsets to be successfully rolled out. kubectl rollout status daemonset --namespace nvidia-network-operator kube-multus-ds

In this section, you'll provision your DPUs and deploy the required services. You'll need to create a DPUDeployment object that defines which DPUServices should be installed on each selected DPU. This provides a flexible way to specify and manage the services that run on your DPUs.

If you want to learn more about DPUDeployments , check the DPUDeployment documentation.

This guide includes examples for both SNAP Block (NVMe) and SNAP VirtioFS Storage. Please refer to the relevant sections below and follow the instructions to deploy the desired storage type.

A number of environment variables must be set before running this command.

Copy Copied! cat manifests/ 04.1 -dpudeployment-installation-nvme/*.yaml | envsubst | kubectl apply -f -

This will deploy the following objects:

BFB to download Bluefield Bitstream to a shared volume

Copy Copied! --- apiVersion: provisioning.dpu.nvidia.com/v1alpha1 kind: BFB metadata: name: bf-bundle-$TAG namespace: dpf-operator-system spec: url: $BFB_URL

HBN + SNAP NVMe DPUFlavor to configure DPUs on provisioning

Collapse Source Copy Copied! --- apiVersion: provisioning.dpu.nvidia.com/v1alpha1 kind: DPUFlavor metadata: name: hbn-snap-nvme-$TAG namespace: dpf-operator-system spec: bfcfgParameters: - UPDATE_ATF_UEFI=yes - UPDATE_DPU_OS=yes - WITH_NIC_FW_UPDATE=yes configFiles: - operation: override path: /etc/mellanox/mlnx-bf.conf permissions: "0644" raw: | ALLOW_SHARED_RQ= "no" IPSEC_FULL_OFFLOAD= "no" ENABLE_ESWITCH_MULTIPORT= "yes" RDMA_SET_NETNS_EXCLUSIVE= "no" - operation: override path: /etc/mellanox/mlnx-ovs.conf permissions: "0644" raw: | CREATE_OVS_BRIDGES= "no" OVS_DOCA= "yes" - operation: override path: /etc/mellanox/mlnx-sf.conf permissions: "0644" raw: "" grub: kernelParameters: - console=hvc0 - console=ttyAMA0 - earlycon=pl011, 0x13010000 - fixrttc - net.ifnames= 0 - biosdevname= 0 - iommu.passthrough= 1 - cgroup_no_v1=net_prio,net_cls - hugepagesz=2048kB - hugepages= 5120 nvconfig: - device: '*' parameters: - PF_BAR2_ENABLE= 0 - PER_PF_NUM_SF= 1 - PF_TOTAL_SF= 20 - PF_SF_BAR_SIZE= 10 - NUM_PF_MSIX_VALID= 0 - PF_NUM_PF_MSIX_VALID= 1 - PF_NUM_PF_MSIX= 228 - INTERNAL_CPU_MODEL= 1 - INTERNAL_CPU_OFFLOAD_ENGINE= 0 - SRIOV_EN= 1 - NUM_OF_VFS= 46 - LAG_RESOURCE_ALLOCATION= 1 - PCI_SWITCH_EMULATION_ENABLE= 1 - PCI_SWITCH_EMULATION_NUM_PORT= 32 - NVME_EMULATION_ENABLE= 1 - NVME_EMULATION_NUM_PF= 0 - LINK_TYPE_P1=ETH - LINK_TYPE_P2=ETH ovs: rawConfigScript: | _ovs-vsctl() { ovs-vsctl --no-wait --timeout 15 "$@" } _ovs-vsctl set Open_vSwitch . other_config:doca-init= true _ovs-vsctl set Open_vSwitch . other_config:dpdk-max-memzones= 50000 _ovs-vsctl set Open_vSwitch . other_config:hw-offload= true _ovs-vsctl set Open_vSwitch . other_config:pmd-quiet-idle= true _ovs-vsctl set Open_vSwitch . other_config:max-idle= 20000 _ovs-vsctl set Open_vSwitch . other_config:max-revalidator= 5000 _ovs-vsctl -- if -exists del-br ovsbr1 _ovs-vsctl -- if -exists del-br ovsbr2 _ovs-vsctl --may-exist add-br br-sfc _ovs-vsctl set bridge br-sfc datapath_type=netdev _ovs-vsctl set bridge br-sfc fail_mode=secure _ovs-vsctl --may-exist add-port br-sfc p0 _ovs-vsctl set Interface p0 type=dpdk _ovs-vsctl set Interface p0 mtu_request= 9216 _ovs-vsctl set Port p0 external_ids:dpf-type=physical _ovs-vsctl --may-exist add-port br-sfc p1 _ovs-vsctl set Interface p1 type=dpdk _ovs-vsctl set Interface p1 mtu_request= 9216 _ovs-vsctl set Port p1 external_ids:dpf-type=physical _ovs-vsctl --may-exist add-br br-hbn _ovs-vsctl set bridge br-hbn datapath_type=netdev _ovs-vsctl set bridge br-hbn fail_mode=secure

DPUServiceConfiguration and DPUServiceTemplate for DOCA HBN

Collapse Source Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: doca-hbn namespace: dpf-operator-system spec: deploymentServiceName: "doca-hbn" serviceConfiguration: serviceDaemonSet: annotations: k8s.v1.cni.cncf.io/networks: |- [ { "name" : "iprequest" , "interface" : "ip_lo" , "cni-args" : { "poolNames" : [ "loopback" ], "poolType" : "cidrpool" }}, { "name" : "iprequest" , "interface" : "ip_pf0vf10" , "cni-args" : { "poolNames" : [ "pool1" ], "poolType" : "cidrpool" , "allocateDefaultGateway" : true }}, { "name" : "iprequest" , "interface" : "ip_pf1vf10" , "cni-args" : { "poolNames" : [ "pool2" ], "poolType" : "cidrpool" , "allocateDefaultGateway" : true }} ] helmChart: values: configuration: perDPUValuesYAML: | - hostnamePattern: "*" values: bgp_peer_group: hbn vrf1: RED vrf2: BLUE l3vni1: 100001 l3vni2: 100002 - hostnamePattern: "worker1*" values: bgp_autonomous_system: 65101 - hostnamePattern: "worker2*" values: bgp_autonomous_system: 65201 startupYAMLJ2: | - header: model: BLUEFIELD nvue-api-version: nvue_v1 rev-id: 1.0 version: HBN 3.0 . 0 - set: evpn: enable: on route-advertise: {} bridge: domain: br_default: vlan: '10' : vni: '10' : {} interface : lo: ip: address: {{ ipaddresses.ip_lo.ip }}/ 32 : {} type: loopback p0_if,p1_if,pf0vf10_if,pf1vf10_if,snap_if: type: swp link: mtu: 9000 pf0vf10_if: ip: address: {{ ipaddresses.ip_pf0vf10.cidr }}: {} vrf: {{ config.vrf1 }} pf1vf10_if: ip: address: {{ ipaddresses.ip_pf1vf10.cidr }}: {} vrf: {{ config.vrf2 }} snap_if: bridge: domain: br_default: access: 10 vlan10: type: svi vlan: 10 nve: vxlan: arp-nd-suppress: on enable: on source: address: {{ ipaddresses.ip_lo.ip }} router: bgp: enable: on graceful-restart: mode: full vrf: default : router: bgp: address-family: ipv4-unicast: enable: on redistribute: connected: enable: on multipaths: ebgp: 16 l2vpn-evpn: enable: on autonomous-system: {{ config.bgp_autonomous_system }} enable: on neighbor: p0_if: peer-group: {{ config.bgp_peer_group }} type: unnumbered address-family: l2vpn-evpn: enable: on add-path-tx: off p1_if: peer-group: {{ config.bgp_peer_group }} type: unnumbered address-family: l2vpn-evpn: enable: on add-path-tx: off path-selection: multipath: aspath-ignore: on peer-group: {{ config.bgp_peer_group }}: address-family: ipv4-unicast: enable: on l2vpn-evpn: enable: on remote-as: external router-id: {{ ipaddresses.ip_lo.ip }} {{ config.vrf1 }}: evpn: enable: on vni: {{ config.l3vni1 }}: {} router: bgp: address-family: ipv4-unicast: enable: on redistribute: connected: enable: on route-export: to-evpn: enable: on autonomous-system: {{ config.bgp_autonomous_system }} enable: on {{ config.vrf2 }}: evpn: enable: on vni: {{ config.l3vni2 }}: {} router: bgp: address-family: ipv4-unicast: enable: on redistribute: connected: enable: on route-export: to-evpn: enable: on autonomous-system: {{ config.bgp_autonomous_system }} enable: on interfaces: - name: p0_if network: mybrhbn - name: p1_if network: mybrhbn - name: pf0vf10_if network: mybrhbn - name: pf1vf10_if network: mybrhbn - name: snap_if network: mybrhbn

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: doca-hbn namespace: dpf-operator-system spec: deploymentServiceName: "doca-hbn" helmChart: source: repoURL: $HELM_REGISTRY_REPO_URL version: 1.0 . 5 chart: doca-hbn values: image: repository: $HBN_NGC_IMAGE_URL tag: 3.2 . 1 -doca3. 2.1 resources: memory: 6Gi nvidia.com/bf_sf: 5

DPUServiceInterfaces for physical ports and VFs on the DPU

Collapse Source Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceInterface metadata: name: p0 namespace: dpf-operator-system spec: template: spec: template: metadata: labels: uplink: "p0" spec: interfaceType: physical physical: interfaceName: p0 --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceInterface metadata: name: p1 namespace: dpf-operator-system spec: template: spec: template: metadata: labels: uplink: "p1" spec: interfaceType: physical physical: interfaceName: p1 --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceInterface metadata: name: pf0vf10-rep namespace: dpf-operator-system spec: template: spec: template: metadata: labels: vf: "pf0vf10" spec: interfaceType: vf vf: parentInterfaceRef: p0 pfID: 0 vfID: 10 --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceInterface metadata: name: pf1vf10-rep namespace: dpf-operator-system spec: template: spec: template: metadata: labels: vf: "pf1vf10" spec: interfaceType: vf vf: parentInterfaceRef: p1 pfID: 1 vfID: 10

DPUServiceIPAMs

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceIPAM metadata: name: pool1 namespace: dpf-operator-system spec: ipv4Network: network: "10.0.121.0/24" gatewayIndex: 2 prefixSize: 29 --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceIPAM metadata: name: pool2 namespace: dpf-operator-system spec: ipv4Network: network: "10.0.122.0/24" gatewayIndex: 2 prefixSize: 29

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceIPAM metadata: name: loopback namespace: dpf-operator-system spec: ipv4Network: network: "11.0.0.0/24" prefixSize: 32

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceIPAM metadata: name: storage-pool namespace: dpf-operator-system spec: metadata: labels: svc.dpu.nvidia.com/pool: storage-pool ipv4Subnet: subnet: "10.0.124.0/24" gateway: "10.0.124.1" perNodeIPCount: 8

DPUServiceConfiguration and DPUServiceTemplate for DOCA SNAP (NVMe)

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: doca-snap namespace: dpf-operator-system spec: deploymentServiceName: doca-snap serviceConfiguration: helmChart: values: dpu: docaSnap: enabled: true image: repository: $SNAP_NGC_IMAGE_URL tag: 1.5 . 0 -doca3. 2.0 snapRpcInitConf: | nvme_subsystem_create --nqn nqn. 2022 - 10 .io.nvda.nvme: 0 interfaces: - name: app_sf network: mybrsfc

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: doca-snap namespace: dpf-operator-system spec: deploymentServiceName: doca-snap helmChart: source: repoURL: $REGISTRY version: $TAG chart: dpf-storage values: serviceDaemonSet: resources: memory: "2Gi" hugepages-2Mi: "4Gi" cpu: "8" nvidia.com/bf_sf: 1 resourceRequirements: memory: "2Gi" hugepages-2Mi: "4Gi" cpu: "8" nvidia.com/bf_sf: 1

DPUServiceConfiguration and DPUServiceTemplate for SNAP Host Controller

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: snap-host-controller namespace: dpf-operator-system spec: deploymentServiceName: snap-host-controller upgradePolicy: applyNodeEffect: false serviceConfiguration: deployInCluster: true helmChart: values: host: snapHostController: enabled: true config: targetNamespace: dpf-operator-system affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "node-role.kubernetes.io/master" operator: Exists - matchExpressions: - key: "node-role.kubernetes.io/control-plane" operator: Exists

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: snap-host-controller namespace: dpf-operator-system spec: deploymentServiceName: snap-host-controller helmChart: source: repoURL: $REGISTRY version: $TAG chart: dpf-storage

DPUServiceConfiguration and DPUServiceTemplate for SNAP Node Driver

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: snap-node-driver namespace: dpf-operator-system spec: deploymentServiceName: snap-node-driver serviceConfiguration: helmChart: values: dpu: deployCrds: true snapNodeDriver: enabled: true

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: snap-node-driver namespace: dpf-operator-system spec: deploymentServiceName: snap-node-driver helmChart: source: repoURL: $REGISTRY version: $TAG chart: dpf-storage

DPUServiceConfiguration and DPUServiceTemplate for SNAP CSI Plugin (NVMe)

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: snap-csi-plugin namespace: dpf-operator-system spec: deploymentServiceName: snap-csi-plugin upgradePolicy: applyNodeEffect: false serviceConfiguration: deployInCluster: true helmChart: values: host: snapCsiPlugin: enabled: true emulationMode: "nvme" controller: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "node-role.kubernetes.io/master" operator: Exists - matchExpressions: - key: "node-role.kubernetes.io/control-plane" operator: Exists

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: snap-csi-plugin namespace: dpf-operator-system spec: deploymentServiceName: snap-csi-plugin helmChart: source: repoURL: $REGISTRY version: $TAG chart: dpf-storage

DPUServiceConfiguration and DPUServiceTemplate for Block Storage DPU Plugin

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: block-storage-dpu-plugin namespace: dpf-operator-system spec: deploymentServiceName: block-storage-dpu-plugin serviceConfiguration: helmChart: values: dpu: blockStorageVendorDpuPlugin: enabled: true

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: block-storage-dpu-plugin namespace: dpf-operator-system spec: deploymentServiceName: block-storage-dpu-plugin helmChart: source: repoURL: $REGISTRY version: $TAG chart: dpf-storage

DPUServiceConfiguration, DPUServiceTemplate and DPUServiceCredentialRequest for SPDK CSI Controller

Collapse Source Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: spdk-csi-controller namespace: dpf-operator-system spec: deploymentServiceName: spdk-csi-controller upgradePolicy: applyNodeEffect: false serviceConfiguration: deployInCluster: true helmChart: values: host: enabled: true config: targets: nodes: # name of the target - name: spdk-target # management address rpcURL: http: # type of the target, e.g. nvme-tcp, nvme-rdma targetType: nvme-rdma # target service IP targetAddr: 10.0 . 124.1 # required parameter, name of the secret that contains connection # details to access the DPU cluster. # this secret should be created by the DPUServiceCredentialRequest API. dpuClusterSecret: spdk-csi-controller-dpu-cluster-credentials

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: spdk-csi-controller namespace: dpf-operator-system spec: deploymentServiceName: spdk-csi-controller helmChart: source: repoURL: oci: version: v0. 3.0 chart: spdk-csi-controller

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceCredentialRequest metadata: name: spdk-csi-controller-credentials namespace: dpf-operator-system spec: duration: 10m serviceAccount: name: spdk-csi-controller-sa namespace: dpf-operator-system targetCluster: name: dpu-cplane-tenant1 namespace: dpu-cplane-tenant1 type: tokenFile secret: name: spdk-csi-controller-dpu-cluster-credentials namespace: dpf-operator-system

Secret for SPDK CSI Controller

Copy Copied! --- apiVersion: v1 kind: Secret metadata: name: spdkcsi-secret namespace: dpf-operator-system labels: # this label enables replication of the secret from the host to the dpu cluster dpu.nvidia.com/image-pull-secret: "" stringData: # name field in the "rpcTokens" list should match name of the # spdk target from DPUService.helmChart.values.host.config.targets.nodes secret.json: |- { "rpcTokens" : [ { "name" : "spdk-target" , "username" : "exampleuser" , "password" : "examplepassword" } ] }

DPUServiceConfiguration and DPUServiceTemplate for SPDK CSI Controller on DPU

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: spdk-csi-controller-dpu namespace: dpf-operator-system spec: deploymentServiceName: spdk-csi-controller-dpu upgradePolicy: applyNodeEffect: false serviceConfiguration: helmChart: values: dpu: enabled: true storageClass: # the name of the storage class that will be created for spdk-csi, # this StorageClass name should be used in the StorageVendor settings name: spdkcsi-sc # name of the secret that contains credentials for the remote SPDK target, # content of the secret is injected during CreateVolume request secretName: spdkcsi-secret # namespace of the secret with credentials for the remote SPDK target secretNamespace: dpf-operator-system rbacRoles: spdkCsiController: # the name of the service account for spdk-csi-controller # this value must be aligned with the value from the DPUServiceCredentialRequest serviceAccount: spdk-csi-controller-sa

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: spdk-csi-controller-dpu namespace: dpf-operator-system spec: deploymentServiceName: spdk-csi-controller-dpu helmChart: source: repoURL: oci: version: v0. 3.0 chart: spdk-csi-controller

DPUDeployment to provision DPUs on worker nodes

Collapse Source Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUDeployment metadata: name: hbn-snap namespace: dpf-operator-system spec: dpus: bfb: bf-bundle-$TAG flavor: hbn-snap-nvme-$TAG dpuSets: - nameSuffix: "dpuset1" dpuAnnotations: storage.nvidia.com/preferred-dpu: "true" nodeSelector: matchLabels: feature.node.kubernetes.io/dpu-enabled: "true" services: doca-hbn: serviceTemplate: doca-hbn serviceConfiguration: doca-hbn snap-csi-plugin: serviceTemplate: snap-csi-plugin serviceConfiguration: snap-csi-plugin snap-host-controller: serviceTemplate: snap-host-controller serviceConfiguration: snap-host-controller snap-node-driver: serviceTemplate: snap-node-driver serviceConfiguration: snap-node-driver doca-snap: serviceTemplate: doca-snap serviceConfiguration: doca-snap block-storage-dpu-plugin: serviceTemplate: block-storage-dpu-plugin serviceConfiguration: block-storage-dpu-plugin spdk-csi-controller: serviceTemplate: spdk-csi-controller serviceConfiguration: spdk-csi-controller spdk-csi-controller-dpu: serviceTemplate: spdk-csi-controller-dpu serviceConfiguration: spdk-csi-controller-dpu serviceChains: switches: - ports: - serviceInterface: matchLabels: uplink: p0 - service: name: doca-hbn interface : p0_if - ports: - serviceInterface: matchLabels: uplink: p1 - service: name: doca-hbn interface : p1_if - ports: - serviceInterface: matchLabels: vf: pf0vf10 - service: name: doca-hbn interface : pf0vf10_if - ports: - serviceInterface: matchLabels: vf: pf1vf10 - service: name: doca-hbn interface : pf1vf10_if - ports: - service: name: doca-snap interface : app_sf ipam: matchLabels: svc.dpu.nvidia.com/pool: storage-pool - service: name: doca-hbn interface : snap_if





These verification commands may need to be run multiple times to ensure the condition is met.

Copy Copied! ## Ensure the DPUServices are created and have been reconciled. kubectl wait -- for =condition=ApplicationsReconciled --namespace dpf-operator-system dpuservices -l svc.dpu.nvidia.com/owned-by-dpudeployment=dpf-operator-system_hbn-snap ## Ensure the DPUServiceIPAMs have been reconciled kubectl wait -- for =condition=DPUIPAMObjectReconciled --namespace dpf-operator-system dpuserviceipam --all ## Ensure the DPUServiceInterfaces have been reconciled kubectl wait -- for =condition=ServiceInterfaceSetReconciled --namespace dpf-operator-system dpuserviceinterface --all ## Ensure the DPUServiceChains have been reconciled kubectl wait -- for =condition=ServiceChainSetReconciled --namespace dpf-operator-system dpuservicechain --all

A number of environment variables must be set before running this command.

Copy Copied! cat manifests/ 04.2 -dpudeployment-installation-virtiofs/*.yaml | envsubst | kubectl apply -f -

This will deploy the following objects:

BFB to download BlueField Bitstream to a shared volume

Copy Copied! --- apiVersion: provisioning.dpu.nvidia.com/v1alpha1 kind: BFB metadata: name: bf-bundle-$TAG namespace: dpf-operator-system spec: url: $BFB_URL

SNAP VirtioFS with HBN DPUFlavor to configure DPUs on provisioning

Collapse Source Copy Copied! --- apiVersion: provisioning.dpu.nvidia.com/v1alpha1 kind: DPUFlavor metadata: name: hbn-snap-virtiofs-$TAG namespace: dpf-operator-system spec: bfcfgParameters: - UPDATE_ATF_UEFI=yes - UPDATE_DPU_OS=yes - WITH_NIC_FW_UPDATE=yes configFiles: - operation: override path: /etc/mellanox/mlnx-bf.conf permissions: "0644" raw: | ALLOW_SHARED_RQ= "no" IPSEC_FULL_OFFLOAD= "no" ENABLE_ESWITCH_MULTIPORT= "yes" RDMA_SET_NETNS_EXCLUSIVE= "no" - operation: override path: /etc/mellanox/mlnx-ovs.conf permissions: "0644" raw: | CREATE_OVS_BRIDGES= "no" OVS_DOCA= "yes" - operation: override path: /etc/mellanox/mlnx-sf.conf permissions: "0644" raw: "" grub: kernelParameters: - console=hvc0 - console=ttyAMA0 - earlycon=pl011, 0x13010000 - fixrttc - net.ifnames= 0 - biosdevname= 0 - iommu.passthrough= 1 - cgroup_no_v1=net_prio,net_cls - hugepagesz=2048kB - hugepages= 5120 nvconfig: - device: '*' parameters: - PF_BAR2_ENABLE= 0 - PER_PF_NUM_SF= 1 - PF_TOTAL_SF= 20 - PF_SF_BAR_SIZE= 10 - NUM_PF_MSIX_VALID= 0 - PF_NUM_PF_MSIX_VALID= 1 - PF_NUM_PF_MSIX= 228 - INTERNAL_CPU_MODEL= 1 - INTERNAL_CPU_OFFLOAD_ENGINE= 0 - SRIOV_EN= 1 - NUM_OF_VFS= 46 - LAG_RESOURCE_ALLOCATION= 1 - PCI_SWITCH_EMULATION_ENABLE= 1 - PCI_SWITCH_EMULATION_NUM_PORT= 32 - VIRTIO_FS_EMULATION_ENABLE= 1 - VIRTIO_FS_EMULATION_NUM_PF= 0 - LINK_TYPE_P1=ETH - LINK_TYPE_P2=ETH ovs: rawConfigScript: | _ovs-vsctl() { ovs-vsctl --no-wait --timeout 15 "$@" } _ovs-vsctl set Open_vSwitch . other_config:doca-init= true _ovs-vsctl set Open_vSwitch . other_config:dpdk-max-memzones= 50000 _ovs-vsctl set Open_vSwitch . other_config:hw-offload= true _ovs-vsctl set Open_vSwitch . other_config:pmd-quiet-idle= true _ovs-vsctl set Open_vSwitch . other_config:max-idle= 20000 _ovs-vsctl set Open_vSwitch . other_config:max-revalidator= 5000 _ovs-vsctl -- if -exists del-br ovsbr1 _ovs-vsctl -- if -exists del-br ovsbr2 _ovs-vsctl --may-exist add-br br-sfc _ovs-vsctl set bridge br-sfc datapath_type=netdev _ovs-vsctl set bridge br-sfc fail_mode=secure _ovs-vsctl --may-exist add-port br-sfc p0 _ovs-vsctl set Interface p0 type=dpdk _ovs-vsctl set Interface p0 mtu_request= 9216 _ovs-vsctl set Port p0 external_ids:dpf-type=physical _ovs-vsctl --may-exist add-port br-sfc p1 _ovs-vsctl set Interface p1 type=dpdk _ovs-vsctl set Interface p1 mtu_request= 9216 _ovs-vsctl set Port p1 external_ids:dpf-type=physical _ovs-vsctl --may-exist add-br br-hbn _ovs-vsctl set bridge br-hbn datapath_type=netdev _ovs-vsctl set bridge br-hbn fail_mode=secure

DPUServiceConfiguration and DPUServiceTemplate for DOCA HBN

Collapse Source Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: doca-hbn namespace: dpf-operator-system spec: deploymentServiceName: "doca-hbn" serviceConfiguration: serviceDaemonSet: annotations: k8s.v1.cni.cncf.io/networks: |- [ { "name" : "iprequest" , "interface" : "ip_lo" , "cni-args" : { "poolNames" : [ "loopback" ], "poolType" : "cidrpool" }}, { "name" : "iprequest" , "interface" : "ip_pf0vf10" , "cni-args" : { "poolNames" : [ "pool1" ], "poolType" : "cidrpool" , "allocateDefaultGateway" : true }}, { "name" : "iprequest" , "interface" : "ip_pf1vf10" , "cni-args" : { "poolNames" : [ "pool2" ], "poolType" : "cidrpool" , "allocateDefaultGateway" : true }} ] helmChart: values: configuration: perDPUValuesYAML: | - hostnamePattern: "*" values: bgp_peer_group: hbn vrf1: RED vrf2: BLUE l3vni1: 100001 l3vni2: 100002 - hostnamePattern: "worker1*" values: bgp_autonomous_system: 65101 - hostnamePattern: "worker2*" values: bgp_autonomous_system: 65201 startupYAMLJ2: | - header: model: BLUEFIELD nvue-api-version: nvue_v1 rev-id: 1.0 version: HBN 3.0 . 0 - set: evpn: enable: on route-advertise: {} bridge: domain: br_default: vlan: '10' : vni: '10' : {} interface : lo: ip: address: {{ ipaddresses.ip_lo.ip }}/ 32 : {} type: loopback p0_if,p1_if,pf0vf10_if,pf1vf10_if,snap_if: type: swp link: mtu: 9000 pf0vf10_if: ip: address: {{ ipaddresses.ip_pf0vf10.cidr }}: {} vrf: {{ config.vrf1 }} pf1vf10_if: ip: address: {{ ipaddresses.ip_pf1vf10.cidr }}: {} vrf: {{ config.vrf2 }} snap_if: bridge: domain: br_default: access: 10 vlan10: type: svi vlan: 10 nve: vxlan: arp-nd-suppress: on enable: on source: address: {{ ipaddresses.ip_lo.ip }} router: bgp: enable: on graceful-restart: mode: full vrf: default : router: bgp: address-family: ipv4-unicast: enable: on redistribute: connected: enable: on multipaths: ebgp: 16 l2vpn-evpn: enable: on autonomous-system: {{ config.bgp_autonomous_system }} enable: on neighbor: p0_if: peer-group: {{ config.bgp_peer_group }} type: unnumbered address-family: l2vpn-evpn: enable: on add-path-tx: off p1_if: peer-group: {{ config.bgp_peer_group }} type: unnumbered address-family: l2vpn-evpn: enable: on add-path-tx: off path-selection: multipath: aspath-ignore: on peer-group: {{ config.bgp_peer_group }}: address-family: ipv4-unicast: enable: on l2vpn-evpn: enable: on remote-as: external router-id: {{ ipaddresses.ip_lo.ip }} {{ config.vrf1 }}: evpn: enable: on vni: {{ config.l3vni1 }}: {} router: bgp: address-family: ipv4-unicast: enable: on redistribute: connected: enable: on route-export: to-evpn: enable: on autonomous-system: {{ config.bgp_autonomous_system }} enable: on {{ config.vrf2 }}: evpn: enable: on vni: {{ config.l3vni2 }}: {} router: bgp: address-family: ipv4-unicast: enable: on redistribute: connected: enable: on route-export: to-evpn: enable: on autonomous-system: {{ config.bgp_autonomous_system }} enable: on interfaces: - name: p0_if network: mybrhbn - name: p1_if network: mybrhbn - name: pf0vf10_if network: mybrhbn - name: pf1vf10_if network: mybrhbn - name: snap_if network: mybrhbn

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: doca-hbn namespace: dpf-operator-system spec: deploymentServiceName: "doca-hbn" helmChart: source: repoURL: $HELM_REGISTRY_REPO_URL version: 1.0 . 5 chart: doca-hbn values: image: repository: $HBN_NGC_IMAGE_URL tag: 3.2 . 1 -doca3. 2.1 resources: memory: 6Gi nvidia.com/bf_sf: 5

DPUServiceConfiguration and DPUServiceTemplate for DOCA SNAP

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: doca-snap namespace: dpf-operator-system spec: deploymentServiceName: doca-snap serviceConfiguration: helmChart: values: dpu: docaSnap: enabled: true env: XLIO_ENABLED: "0" image: repository: $SNAP_NGC_IMAGE_URL tag: 1.5 . 0 -doca3. 2.0 interfaces: - name: app_sf network: mybrsfc

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: doca-snap namespace: dpf-operator-system spec: deploymentServiceName: doca-snap helmChart: source: repoURL: $REGISTRY version: $TAG chart: dpf-storage values: serviceDaemonSet: resources: memory: "2Gi" hugepages-2Mi: "4Gi" cpu: "8" nvidia.com/bf_sf: 1 resourceRequirements: memory: "2Gi" hugepages-2Mi: "4Gi" cpu: "8" nvidia.com/bf_sf: 1

DPUServiceConfiguration and DPUServiceTemplate for SNAP Host Controller

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: snap-host-controller namespace: dpf-operator-system spec: deploymentServiceName: snap-host-controller upgradePolicy: applyNodeEffect: false serviceConfiguration: deployInCluster: true helmChart: values: host: snapHostController: enabled: true config: targetNamespace: dpf-operator-system affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "node-role.kubernetes.io/master" operator: Exists - matchExpressions: - key: "node-role.kubernetes.io/control-plane" operator: Exists

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: snap-host-controller namespace: dpf-operator-system spec: deploymentServiceName: snap-host-controller helmChart: source: repoURL: $REGISTRY version: $TAG chart: dpf-storage

DPUServiceConfiguration and DPUServiceTemplate for SNAP Node Driver

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: snap-node-driver namespace: dpf-operator-system spec: deploymentServiceName: snap-node-driver serviceConfiguration: helmChart: values: dpu: deployCrds: true snapNodeDriver: enabled: true

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: snap-node-driver namespace: dpf-operator-system spec: deploymentServiceName: snap-node-driver helmChart: source: repoURL: $REGISTRY version: $TAG chart: dpf-storage

DPUServiceConfiguration and DPUServiceTemplate for SNAP CSI Plugin

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: snap-csi-plugin namespace: dpf-operator-system spec: deploymentServiceName: snap-csi-plugin upgradePolicy: applyNodeEffect: false serviceConfiguration: deployInCluster: true helmChart: values: host: snapCsiPlugin: enabled: true emulationMode: "virtiofs" controller: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "node-role.kubernetes.io/master" operator: Exists - matchExpressions: - key: "node-role.kubernetes.io/control-plane" operator: Exists

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: snap-csi-plugin namespace: dpf-operator-system spec: deploymentServiceName: snap-csi-plugin helmChart: source: repoURL: $REGISTRY version: $TAG chart: dpf-storage

DPUServiceConfiguration and DPUServiceTemplate for FS Storage DPU Plugin

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: fs-storage-dpu-plugin namespace: dpf-operator-system spec: deploymentServiceName: fs-storage-dpu-plugin serviceConfiguration: helmChart: values: dpu: fsStorageVendorDpuPlugin: enabled: true interfaces: - name: app_sf network: mybrsfc

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: fs-storage-dpu-plugin namespace: dpf-operator-system spec: deploymentServiceName: fs-storage-dpu-plugin helmChart: source: repoURL: $REGISTRY version: $TAG chart: dpf-storage values: serviceDaemonSet: resources: nvidia.com/bf_sf: 1 resourceRequirements: nvidia.com/bf_sf: 1

DPUServiceConfiguration, DPUServiceTemplate and DPUServiceCredentialRequest for NFS CSI Controller (host)

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: nfs-csi-controller namespace: dpf-operator-system spec: deploymentServiceName: nfs-csi-controller upgradePolicy: applyNodeEffect: false serviceConfiguration: deployInCluster: true helmChart: values: host: enabled: true config: # required parameter, name of the secret that contains connection # details to access the DPU cluster. # this secret should be created by the DPUServiceCredentialRequest API. dpuClusterSecret: nfs-csi-controller-dpu-cluster-credentials

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: nfs-csi-controller namespace: dpf-operator-system spec: deploymentServiceName: nfs-csi-controller helmChart: source: repoURL: oci: version: v0. 2.0 chart: nfs-csi-controller

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceCredentialRequest metadata: name: nfs-csi-controller-credentials namespace: dpf-operator-system spec: duration: 24h serviceAccount: name: nfs-csi-controller-sa namespace: dpf-operator-system targetCluster: name: dpu-cplane-tenant1 namespace: dpu-cplane-tenant1 type: tokenFile secret: name: nfs-csi-controller-dpu-cluster-credentials namespace: dpf-operator-system

DPUServiceConfiguration and DPUServiceTemplate for NFS CSI Controller (DPU)

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceConfiguration metadata: name: nfs-csi-controller-dpu namespace: dpf-operator-system spec: deploymentServiceName: nfs-csi-controller-dpu upgradePolicy: applyNodeEffect: false serviceConfiguration: helmChart: values: dpu: enabled: true storageClasses: # List of storage classes to be created for nfs-csi # These StorageClass names should be used in the StorageVendor settings - name: nfs-csi parameters: server: 10.0 . 124.1 share: /srv/nfs/share rbacRoles: nfsCsiController: # the name of the service account for nfs-csi-controller # this value must be aligned with the value from the DPUServiceCredentialRequest serviceAccount: nfs-csi-controller-sa

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceTemplate metadata: name: nfs-csi-controller-dpu namespace: dpf-operator-system spec: deploymentServiceName: nfs-csi-controller-dpu helmChart: source: repoURL: oci: version: v0. 2.0 chart: nfs-csi-controller

DPUServiceIPAMs

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceIPAM metadata: name: pool1 namespace: dpf-operator-system spec: ipv4Network: network: "10.0.121.0/24" gatewayIndex: 2 prefixSize: 29 --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceIPAM metadata: name: pool2 namespace: dpf-operator-system spec: ipv4Network: network: "10.0.122.0/24" gatewayIndex: 2 prefixSize: 29

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceIPAM metadata: name: loopback namespace: dpf-operator-system spec: ipv4Network: network: "11.0.0.0/24" prefixSize: 32

Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceIPAM metadata: name: storage-pool namespace: dpf-operator-system spec: metadata: labels: svc.dpu.nvidia.com/pool: storage-pool ipv4Subnet: subnet: "10.0.124.0/24" gateway: "10.0.124.1" perNodeIPCount: 8

DPUServiceInterfaces for physical ports and VFs on the DPU

Collapse Source Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceInterface metadata: name: p0 namespace: dpf-operator-system spec: template: spec: template: metadata: labels: uplink: "p0" spec: interfaceType: physical physical: interfaceName: p0 --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceInterface metadata: name: p1 namespace: dpf-operator-system spec: template: spec: template: metadata: labels: uplink: "p1" spec: interfaceType: physical physical: interfaceName: p1 --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceInterface metadata: name: pf0vf10-rep namespace: dpf-operator-system spec: template: spec: template: metadata: labels: vf: "pf0vf10" spec: interfaceType: vf vf: parentInterfaceRef: p0 pfID: 0 vfID: 10 --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUServiceInterface metadata: name: pf1vf10-rep namespace: dpf-operator-system spec: template: spec: template: metadata: labels: vf: "pf1vf10" spec: interfaceType: vf vf: parentInterfaceRef: p1 pfID: 1 vfID: 10

DPUDeployment to provision DPUs on worker nodes

Collapse Source Copy Copied! --- apiVersion: svc.dpu.nvidia.com/v1alpha1 kind: DPUDeployment metadata: name: hbn-snap namespace: dpf-operator-system spec: dpus: bfb: bf-bundle-$TAG flavor: hbn-snap-virtiofs-$TAG dpuSets: - nameSuffix: "dpuset1" dpuAnnotations: storage.nvidia.com/preferred-dpu: "true" nodeSelector: matchLabels: feature.node.kubernetes.io/dpu-enabled: "true" services: doca-hbn: serviceTemplate: doca-hbn serviceConfiguration: doca-hbn snap-csi-plugin: serviceTemplate: snap-csi-plugin serviceConfiguration: snap-csi-plugin snap-host-controller: serviceTemplate: snap-host-controller serviceConfiguration: snap-host-controller snap-node-driver: serviceTemplate: snap-node-driver serviceConfiguration: snap-node-driver doca-snap: serviceTemplate: doca-snap serviceConfiguration: doca-snap fs-storage-dpu-plugin: serviceTemplate: fs-storage-dpu-plugin serviceConfiguration: fs-storage-dpu-plugin nfs-csi-controller: serviceTemplate: nfs-csi-controller serviceConfiguration: nfs-csi-controller nfs-csi-controller-dpu: serviceTemplate: nfs-csi-controller-dpu serviceConfiguration: nfs-csi-controller-dpu serviceChains: switches: - ports: - serviceInterface: matchLabels: uplink: p0 - service: name: doca-hbn interface : p0_if - ports: - serviceInterface: matchLabels: uplink: p1 - service: name: doca-hbn interface : p1_if - ports: - serviceInterface: matchLabels: vf: pf0vf10 - service: name: doca-hbn interface : pf0vf10_if - ports: - serviceInterface: matchLabels: vf: pf1vf10 - service: name: doca-hbn interface : pf1vf10_if - ports: - service: name: doca-snap interface : app_sf ipam: matchLabels: svc.dpu.nvidia.com/pool: storage-pool - service: name: fs-storage-dpu-plugin interface : app_sf ipam: matchLabels: svc.dpu.nvidia.com/pool: storage-pool - service: name: doca-hbn interface : snap_if





These verification commands may need to be run multiple times to ensure the condition is met.

Note that the DPUService name will have a random suffix. For example, doca-hbn-l2xsl .

Verify the DPU and Service installation with:

Copy Copied! ## Ensure the DPUServices are created and have been reconciled. kubectl wait -- for =condition=ApplicationsReconciled --namespace dpf-operator-system dpuservices -l svc.dpu.nvidia.com/owned-by-dpudeployment=dpf-operator-system_hbn-snap ## Ensure the DPUServiceIPAMs have been reconciled kubectl wait -- for =condition=DPUIPAMObjectReconciled --namespace dpf-operator-system dpuserviceipam --all ## Ensure the DPUServiceInterfaces have been reconciled kubectl wait -- for =condition=ServiceInterfaceSetReconciled --namespace dpf-operator-system dpuserviceinterface --all ## Ensure the DPUServiceChains have been reconciled kubectl wait -- for =condition=ServiceChainSetReconciled --namespace dpf-operator-system dpuservicechain --all

At this point workers should be added to the cluster. Each worker node should be configured in line with the prerequisites. As workers are added to the cluster DPUs will be provisioned and DPUServices will begin to be spun up.

You can verify the status of the DPUDeployment and its components with the following command:

Copy Copied! $ kubectl -n dpf-operator-system exec deploy/dpf-operator-controller-manager -- /dpfctl describe dpudeployments

This document contains network configuration that uses the P0 and P1 network PFs and VF 10 from both PFs. These interfaces are referenced in SriovNetworkNodePolicy and NetworkAttachmentDefinition objects, which are created in the next step. The PCI addresses and names of the network PFs and VFs on the host side are likely to change after DPU provisioning, because the DPUFlavor includes the PCI_SWITCH_EMULATION_ENABLE firmware setting.

Before applying the network configuration, you need to identify the new names of the network PFs and VFs on the host side and set the following environment variables:

Copy Copied! # contains the name of the network PF 0 on the host side, e.g. enp8s0f0np0 export DPU_P0_PF_NAME=<REPLACE_WITH_INTERFACE_NAME> # contains the name of the network PF 1 on the host side, e.g. enp8s0f1np1 export DPU_P1_PF_NAME=<REPLACE_WITH_INTERFACE_NAME> # contains the name of the network VF 10 on P0 on the host side, e.g. enp8s0f0v10 export DPU_P0_VF10_NAME=<REPLACE_WITH_INTERFACE_NAME> # contains the name of the network VF 10 on P1 on the host side, e.g. enp8s0f1v10 export DPU_P1_VF10_NAME=<REPLACE_WITH_INTERFACE_NAME>

Copy Copied! cat manifests/ 05 -network-configuration/*.yaml | envsubst | kubectl apply -f -

This will create the following objects:

SriovNetworkNodePolicy to configure VFs on P0 and P1

Collapse Source Copy Copied! --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: bf3-p0-vfs namespace: nvidia-network-operator spec: nicSelector: deviceID: "a2dc" vendor: "15b3" pfNames: - $DPU_P0_PF_NAME# 2 - 45 nodeSelector: node-role.kubernetes.io/worker: "" numVfs: 46 resourceName: bf3-p0-vfs isRdma: true externallyManaged: true deviceType: netdevice linkType: eth --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: bf3-p1-vfs namespace: nvidia-network-operator spec: nicSelector: deviceID: "a2dc" vendor: "15b3" pfNames: - $DPU_P1_PF_NAME# 2 - 45 nodeSelector: node-role.kubernetes.io/worker: "" numVfs: 46 resourceName: bf3-p1-vfs isRdma: true externallyManaged: true deviceType: netdevice linkType: eth

NetworkAttachmentDefinition for host-device VFs

Collapse Source Copy Copied! apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: hostdev-pf0vf10-worker1 spec: config: '{ "cniVersion" : "0.3.1" , "name" : "hostpf0vf10" , "type" : "host-device" , "device" : "$DPU_P0_VF10_NAME" , "ipam" : { "type" : "static" , "addresses" : [ { "address" : "10.0.121.1/29" } ], "routes" : [ { "dst" : "10.0.121.8/29" , "gw" : "10.0.121.2" } ] } }' --- apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: hostdev-pf1vf10-worker1 spec: config: '{ "cniVersion" : "0.3.1" , "name" : "hostpf1vf10" , "type" : "host-device" , "device" : "$DPU_P1_VF10_NAME" , "ipam" : { "type" : "static" , "addresses" : [ { "address" : "10.0.122.1/29" } ], "routes" : [ { "dst" : "10.0.122.8/29" , "gw" : "10.0.122.2" } ] } }' --- apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: hostdev-pf0vf10-worker2 spec: config: '{ "cniVersion" : "0.3.1" , "name" : "hostpf0vf10" , "type" : "host-device" , "device" : "$DPU_P0_VF10_NAME" , "ipam" : { "type" : "static" , "addresses" : [ { "address" : "10.0.121.9/29" } ], "routes" : [ { "dst" : "10.0.121.0/29" , "gw" : "10.0.121.10" } ] } }' --- apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: hostdev-pf1vf10-worker2 spec: config: '{ "cniVersion" : "0.3.1" , "name" : "hostpf1vf10" , "type" : "host-device" , "device" : "$DPU_P1_VF10_NAME" , "ipam" : { "type" : "static" , "addresses" : [ { "address" : "10.0.122.9/29" } ], "routes" : [ { "dst" : "10.0.122.0/29" , "gw" : "10.0.122.10" } ] } }'





Copy Copied! kubectl apply -f manifests/ 06 -network-test

HBN functionality can be tested by pinging between the pods and services deployed in the default namespace.

Copy Copied! cat manifests/ 07.1 -storage-configuration-nvme/*.yaml | envsubst | kubectl apply -f -

This will create the following objects:

DPUStorageVendor for SPDK CSI

Copy Copied! --- apiVersion: storage.dpu.nvidia.com/v1alpha1 kind: DPUStorageVendor metadata: name: spdk-csi namespace: dpf-operator-system spec: storageClassName: spdkcsi-sc pluginName: nvidia-block

DPUStoragePolicy for block storage

Copy Copied! --- apiVersion: storage.dpu.nvidia.com/v1alpha1 kind: DPUStoragePolicy metadata: name: policy-block namespace: dpf-operator-system spec: dpuStorageVendors: - spdk-csi selectionAlgorithm: "NumberVolumes" parameters: {}

Validate the DPUStorageVendor and DPUStoragePolicy objects are ready:

Copy Copied! kubectl wait -- for =condition=Ready --namespace dpf-operator-system dpustoragevendors --all kubectl wait -- for =condition=Ready --namespace dpf-operator-system dpustoragepolicies --all





Deploy storage test pods that request a block volume from SNAP NVMe.

Copy Copied! kubectl apply -f manifests/ 08.1 -storage-test-nvme

Storage functionality can be tested by verifying the block device is attached inside the pod and performing I/O.

Copy Copied! cat manifests/ 07.2 -storage-configuration-virtiofs/*.yaml | envsubst | kubectl apply -f -

This will create the following objects:

DPUStorageVendor for NFS CSI

Copy Copied! --- apiVersion: storage.dpu.nvidia.com/v1alpha1 kind: DPUStorageVendor metadata: name: nfs-csi namespace: dpf-operator-system spec: storageClassName: nfs-csi pluginName: nvidia-fs

DPUStoragePolicy for filesystem policy

Copy Copied! --- apiVersion: storage.dpu.nvidia.com/v1alpha1 kind: DPUStoragePolicy metadata: name: policy-fs namespace: dpf-operator-system spec: dpuStorageVendors: - nfs-csi selectionAlgorithm: "NumberVolumes" parameters: {}

Validate the DPUStorageVendor and DPUStoragePolicy objects are ready:

Copy Copied! kubectl wait -- for =condition=Ready --namespace dpf-operator-system dpustoragevendors --all kubectl wait -- for =condition=Ready --namespace dpf-operator-system dpustoragepolicies --all

Deploy storage test pods that mount a storage volume provided by SNAP VirtioFS.

Copy Copied! kubectl apply -f manifests/ 08.2 -storage-test-virtiofs

Storage functionality can be tested by writing and reading data to the mounted volume.