NVIDIA Docs Hub NVIDIA Networking Networking Software Adapter Software DOCA Platform Framework (DPF) Documentation v25.4 DOCA Platform Framework v25.4.0

DOCA Platform Framework v25.4.0

This is the GA release of the DOCA Platform Framework (DPF). It includes bug fixes and improvements to enhance the provisioning and orchestration of NVIDIA BlueField DPUs in Kubernetes environments.

Features

(Beta) Using Redfish to provision DPUs connected to the management network with the Out-Of-Band interface
(Beta) Provisioning of DPUs in servers that are not Kubernetes workers
(Beta) Graceful upgrade of BFB and DPUServices via the DPUDeployment Custom Resource
Ability to attach DPUServices to a user defined network via the DPUServiceNAD Custom Resource

Fixed Issues from previous release

Note

Detailed information about the fixed issues can be found in the release notes for v25.1.1.

DPUService stuck in its Deleting phase
Traffic loss after reconfiguration of DPUServices with chain between
DPU Provisioning operations wouldn’t be retried
[OVN-Kubernetes DPUService] Only a single OVN-Kubernetes DPU service version can be deployed across the cluster
[HBN DPUService] HBN DPUService cannot dynamically reload configurations
[DTS DPUService] DTS appears as OutOfSync

Supported DPU Services:

OVN-Kubernetes
DOCA Host-Based Networking (HBN)
DOCA Telemetry Service (DTS)
DOCA BlueMan
DOCA SNAP
- Supports NVMe-oF protocol with TCP and RDMA transport
- Supports Kubernetes PersistentVolume with volumeMode: Block
DOCA Firefly

Dependencies

Hardware and Software Requirements

Note

Refer to Prerequisites for detailed requirements. In addition, notice the additional requirements for deploying via Redfish.

DPU Hardware: NVIDIA BlueField-3 DPUs
Minimal DOCA BFB Image: DOCA v2.5 or higher (must be pre-installed on DPUs to support DPF provisioning)
Minimal BMC DPU Firmware Version: >= 24.10 (see Redfish documentation on how to upgrade)
Provisioned DOCA BFB Image: bf-bundle-3.0.0-135

Known Issues and Limitations

On initial deployment DPF CRs maybe stuck in initial state (Pending, Initializing, etc.) and not progressing
- In case DPF CRs were created before DPF components are running they maybe be stuck in their initial state. DPF CRs need to be created after the DPF components have been deployed. In case CRs were created before they may remain in an initial state.
- Internal Ref #4241297
- Workaround: Delete any CRs that were created before the System components have been deployed and recreate them.
Incompatible DPUFlavor can cause DPU to get into an unstable state
- Using an incompatible DPUFlavor can cause the DPU Device to get into an error state which requires manual intervention. For example allocating 14GB of hugepages in a DPU of 16GB memory.
- Internal Ref #4200717
- Workaround: Manually provision DPU or follow DOCA troubleshooting documentation to return DPU to operational state https://docs.nvidia.com/networking/display/bfswtroubleshooting.
Stale ports after DPU reboot
- When rebooting DPU, the old DPU service ports won’t get deleted from DPU’s OVS and would be stale
- Internal Ref #4174183
- Workaround: No workaround, known issue, shouldn’t affect performance.
BFB filename must be unique
- If BFB CR#1 bfb.spec.filename is the same as a BFB CR#2 bfb.spec.filename but references a different URL (actual bfb file to download) then BFB CR#1 would reference the wrong bfb.
- Internal Ref #4143309
- Workaround: Use unique bfb.spec.filename when creating new bfb CRs.
DPU Cluster control-plane connectivity is lost when physical port P0 is down on the worker node
- Link down of p0 port on the DPU will result in DPU control plane connectivity loss of DPU components.
- Internal Ref #3751863
- Workaround: Make sure P0 link is up on the DPU, if down either restart DPU or refer to DOCA troubleshooting https://docs.nvidia.com/networking/display/bfswtroubleshooting.
Cluster MTU value cannot be dynamically changed
- It is possible to deploy DPF cluster with a custom MTU value, however once deployed, it is not possible to modify the MTU value which is applied on multiple distributed components.
- Internal Ref #3917006
- Workaround: Uninstall DPF and re-install from scratch using the new MTU value.
nvidia-k8s-ipam and servicechainset-controller DPF system DPUServices are in “Pending” phase
- As long as there are no provisioned DPUs in the system, the nvidia-k8s-ipam and servicechainset-controller will appear as not ready / pending when querying dpuservices. This has no impact on performance or functionality since DPF system components are only relevant when there are DPUs to provision services on.
- Internal Ref #4241324
- Workaround: No workaround, known issue
System doesn't recover after DPU Reset
- When the user triggers a reset of a DPU in any way other than using DPF APIs (e.g. recreation of a DPU CR), the system may not recover.
- Internal Ref #4424305, #4424235, #4188044
- Workaround: Power cycle the host. Note that this operation is dangerous and there might be file system corruption which will require triggering the reprovisioning of the DPU.
DPUDeployment disruptive upgrade handling is not always graceful
- When a DPUDeployment triggers a disruptive upgrade (default behavior), there is a case that the upgrade will not trigger the related nodeEffect. This can happen when there were changes only to the DPUService related fields.
- Internal Ref #4252072
- Workaround: No workaround, known issue.
DPUDeployment spec.dpuSets.nodeEffect changes require recreation of the CR
- In case the user wants to change the spec.dpuSets.nodeEffect field on an already applied DPUDeployment, the reconciliation of the resource won't succeed.
- Internal Ref #4410218
- Workaround: Recreate the DPUDeployment which will trigger workload disruption and reprovisioning of DPUs.
Leftover CRs if worker is removed from the cluster permanently
- When a worker was added to a cluster, optionally had DPU provisioned and later was removed from the host cluster permanently, there may be leftover DPF related CRs in both the host cluster and the DPU cluster.
- Internal Ref #4403130, #4426516
- Workaround: No workaround, known issue.
DPUSet and BFB removal leaves DPU CRs in OS Installing Phase indefinitely
- When a DPUSet that owns DPU CRs that are in OS Installing phase is deleted together with the BFB that the DPUSet is referencing, the DPUs are stuck in OS Installing Phase indefinitely and can't be removed. This can also be a race condition when deleting the DPFOperatorConfig CR and there are DPUs in OS Installing Phase.
- Internal Ref #4426349
- Workaround: No workaround, known issue.
DPUDeployment with duplicate nameSuffix in dpuSets causes DPF to ignore the first entry and triggers endless reprovisioning
- When defining multiple dpuSets in a single DPUDeployment, if two entries use the same nameSuffix but have different nodeSelectors, the system only respects the last defined dpuSet. The earlier one is ignored without validation or error. This results in DPF attempting to provision DPUs that don't match the second dpuSet's nodeSelector, causing them to get stuck in an infinite reprovisioning loop.
- Internal Ref #4413138
- Workaround: Ensure the nameSuffix is unique across all the entries in the dpuSets list.
Deleting ArgoCD pods on a cluster with workers causes pod to be created on worker nodes
- When deleting the ArgoCD pods, the new ones might end up on a worker and in case the OVN Kubernetes VF injector is installed, this may lead to a deadlock where DPUServices are not reconciled.
- Internal Ref #4401382
- Workaround: 1) Cordon all the worker nodes
  
  bash kubectl cordon -l node-role.kubernetes.io/control-plane!=
  
  2) (Optional) Delete the MutatingWebhookConfiguration related to the OVN Kubernetes VF Injector
  
  bash kubectl delete mutatingwebhookconfiguration ovn-kubernetes-resource-injector
  
  3) Delete the ArgoCD pods
  
  bash kubectl delete pod -n dpf-operator-system -l app.kubernetes.io/part-of=argocd
  
  4) Uncordon all the worker nodes
  
  bash kubectl uncordon -l node-role.kubernetes.io/control-plane!=
  
  5) (Optional) Reinstall the OVN Kubernetes VF Injector as instructed in the OVN related guides
DPUDeployment or DPUSet created in namespace different from the dpf-operator-system namespace do not trigger DPU provisioning
- When creating a DPUDeployment or a DPUSet in a namespace other than dpf-operator-system, there are no DPU CRs created due to the DPUNode CRs residing in the dpf-operator-system namespace.
- Internal Ref #4427091
- Workaround: Create the DPUDeployment or DPUSet in the dpf-operator-system namespace.
Service Function Chain(SFC) is configured only when all chain components are installed and correct
- When creating a service chain with more than 2 endpoints, introducing bad configuration in one of the DPUServiceInterfaces involved or making one of the DPUServices unavailable, prevents the relevant controller from properly configuring the chain. This means that a bad configuration somewhere in this chain, breaks the entire chain, affecting all services using it for traffic.
- Internal Ref #4367518
- Workaround: No workaround, known issue.
DPUService stops reconciling when DPUCluster is unavailable for long time
- When the DPUCluster is unavailable for long time (more than 5 mins), changes to DPUServices (also generated ones via DPUDeployment or DPFOperatorConfig) that have happened during that time might not be reflected to the DPUCluster.
- Internal Ref #4359857
- Workaround: Recreate DPUServices that are stuck.
dmsinit.sh fails to fetch newly created certificate secret on first run for non Kubernetes workers
- When provisioning DPUs on servers that are not Kubernetes workers, the dmsinit.sh script specified in the documentation may fail in the first run due to cert-manager updating the Certificate spec.secretName before the secret is created in the API server.
- Internal Ref #4445638
- Workaround: Rerun the dmsinit.sh script.
Instability with DPU provisioning when NFS service is not available
- When the NFS service is unavailable, the following behaviors may occur: 1) Deleting a Ready DPU CR does not succeed until the NFS service is restored. No error message is shown to the user. 2) DPU CR in the "OS Installing" phase hangs without any error, and DPU provisioning does not proceed.
- Internal Ref #4123204
- Workaround: Ensure the NFS service is available. For issue #2, delete the affected DPU CR and wait for a new CR to be created. Note that deleting the DPU CR in this phase may take up to 30 minutes.
DPUDeployment should specify supported spec.dpuSets.nodeEffect when deploying on DPUs that are part of servers that are not Kubernetes workers
- When provisioning DPUs on servers that are not Kubernetes workers, the DPUDeployment must have spec.dpuSets.nodeEffect specified and the value should be a value supported in this configuration. See the Node Effects guide for more information.
- Internal Ref #4445649
- Workaround: Specify correct node effect by recreating the DPUDeployment.
[OVN-Kubernetes DPUService] Nodes marked as NotReady
- When installing OVN-Kubernetes as a CNI on a node running containerd version 1.7.0 and above the Node never becomes ready.
- Internal Ref #4178221
- Workaround: Option 1: Use containerd version below v1.7.0 when using OVN-Kubernetes as a primary CNI. Option 2: Manually restart containerd on the host.
[OVN-Kubernetes DPUService] control plane node is not functional after reboot or network restart
- During OVN-Kubernetes CNI installation on the control plane nodes, the management interface is moved with its IP into a newly created OVS bridge. Since this network configuration is not persistent it will be lost during node or network restart.
- Internal Ref #4241306
- Workaround: 1) Pre-define the OVS bridge on each control plane node with the OOB port MAC and IP address and ensure it gets a persistent IP
  
  yaml #Ubuntu example for netplan persistent network configuration: network: ethernets: oob: match: # the mac address of the oob macaddress: xx:xx:xx:xx:xx:xx set-name: oob bridges: br0: addresses: x.x.x.x/x interfaces: [oob] # the mac address of the oob macaddress: xx:xx:xx:xx:xx:xx openvswitch: {} version: 2
  
  2) Set OVS bridge "bridge-uplink" in OVS metadata.
  
  bash ovs-vsctl br-set-external-id br0 bridge-id br0 -- br-set-external-id br0 bridge-uplink oob
[OVN-Kubernetes DPUService] Lost traffic from workloads to control plane components or K8S services after dpu reboot, port flapping, ovs restart or manual network configuration
- Connectivity issues between workload pods to control plane components or K8S services may occur after the following events: DPU reboot without host reboot, high speed port flapping (link down/up), ovs restart, DPU network configuration change (for example using "netplan apply" command on DPU). The issues are caused by network configuration that was applied by ovn CNI on DPUs and won't get reapplied automatically. When rebooting DPU without the host, or high speed port link is going down/up, or manually changing dpu network ( for example with netplan apply), network configuration which was applied by the dpu CNI components may be lost and won’t reapply automatically.
- Internal Ref #4424305, #4188044, #4424235
- Workaround: Recreate the OVN-Kubernetes node pod on the host to reapply the configuration.
[OVN-Kubernetes DPUService] host network configuration may result in lost traffic from host workloads (on overlay)
- When changing host network (for example with netplan apply) custom network configuration which is done by the host CNI components may be lost and won’t reapply automatically.
- Internal Ref #4188044
- Workaround: Recreate the OVN-Kubernetes node pod on the host to reapply the configuration.
[OVN-Kubernetes DPUService] InternalTrafficPolicy and ExternalTrafficPolicy Service options are not handled correctly
- When external traffic reaches a NodePort Kubernetes Service on the host cluster via the OOB Network of the host, the user won't see the relevant policy option working as expected. The same applies for traffic originating from a Pod on the overlay network hitting the same type of service.
- Internal Ref #4320953
- Workaround: No workaround, known issue.
[OVN-Kubernetes DPUService] No network connectivity for SR-IOV accelerated workload pods after DPU reboot
- SR-IOV accelerated workload pod is losing its VF interface upon DPU reboot. VF is available on the host however not injected back into the pod.
- Internal Ref #4236521
- Workaround: Recreate the SR-IOV accelerated workload pods.
[OVN-Kubernetes DPUService] SR-IOV test Pod cannot reach Kubernetes API Service
- When running a SR-IOV test Pod, the pod cannot reach a Kubernetes API Service. The issue is that the related conntrack entries miss the un-nat sometimes.
- Internal Ref #4313629
- Workaround: Unblock is to run the following on the DPUs: ovs-appctl revalidator/purge ovs-appctl dpctl/flush-conntrack
[HBN DPUService] Invalid HBN configuration is not reflected to user in case it is syntactically valid
- If the HBN YAML configuration is valid but contains values that are illegal from an NVUE perspective then the HBN service will start with the last known valid configuration and it won’t be reflected to the end user.
- Internal Ref #4172029
- Workaround: No workaround, known issue.
[HBN + OVN-Kubernetes DPUServices] HBN service restarts on DPU causes worker to lose traffic
- If the HBN pod on the DPU will reset then the workloads on the host (any traffic on the OVN overlay) will not receive traffic.
- Internal Ref #4220185, #4223176
- Workaround: Wait 15 minutes for the system to recover or restart the OVN-Kubernetes Pod on that particular DPU.
[Firefly DPUService] Deletion of the Firefly DPUService leaves stale flows in OVS
- When the Firefly DPUService is deleted after successful deployment or the labels of the serviceDaemonSet are modified, flows are not cleaned up from OVS.
- Internal Ref #4382535
- Workaround: Although these flows are very unlikely to cause an issue, reprovisioning the DPU or power cycling the host will bring the OVS in good state. Note that power cycling the host is dangerous and there might be file system corruption which will require triggering the reprovisioning of the DPU.
[OVN-Kubernetes DPUService] Kubernetes Services sporadically can not be reached after system reboots
- After a node reboots, there is a small chance workload pods cannot reach the Kubernetes Service addresses (i.e. Kubernetes API server). This is because the OVS ARP cache of tunnel endpoints is not correctly populated on the DPU.
- Internal Ref #4448292, #4448355
- Workaround: Generate ICMP traffic from the affected node through the Kubernetes pod network. For example: If the cluster was created with the following pod network POD_CIDR=10.233.64.0/18, each node will get a /24 slice. From the affected node trigger ICMP traffic (i.e. using the ping utility) to the ovn-k8s-mp0 interface on one of the control plane nodes (i.e. ping 10.233.64.2)

On This Page