DOCA Platform Framework v25.10.0
This is the GA release of the DOCA Platform Framework (DPF). It includes bug fixes and improvements to enhance the provisioning and orchestration of NVIDIA BlueField DPUs in Kubernetes environments.
Multi-DPU Support:
Enhancements to support multiple DPUs throughout DPF components
Support provisioning multiple DPUs on the same host with node effect and reboot synchronization
Networking:
RDMA CNI support for DPU workloads
DPF Operator:
Support Resource configuration for DPF components
Support for DPUDeployment disruptive upgrades:
Support for triggering node effect on update of disruptive standard or in-cluster DPUServices and DPUServiceChains
Support for removing the node effect only after all the DPUServices have Pods that are Ready
Support for removing the node effect only after all the DPUServiceChains have ServiceChains that are Ready
Provisioning Enhancements:
New Host Agent component to handle host operations
Upgrade Support:
DPF now supports upgrades from previous GA releases
DPF Storage Enhancements:
Support VirtioFS for Trusted and Zero Trust cases
Support for static PF PCI emulated functions
Support for multiple DPUs on same host
OVN VPC DPUService:
(tech preview) OVN HA support
Zero downtime upgrade guide
dpfctl:
Enhancements to reflect the status of additional DPF objects such as storage, dpunode, dpudevice
Detailed information about the fixed issues can be found in the release notes for v25.7.0.
DPUSet and BFB removal leaves DPU CRs in OS Installing Phase indefinitely
DPUFlavor nvconfig.hostPowerCycleRequired is not used
[VPC OVN DPUService] E/W Traffic Not Passing within Same VPC after DPU Reprovisioning
DPUServiceInterface doesn't wait for ServiceInterface to be deleted
DPUDeployment disruptive upgrade handling is not always graceful
Hardware and Software Requirements
For Host Trusted mode: Refer to Host Trusted Prerequisites for detailed requirements.
For Zero Trust mode: Refer to Zero Trust Prerequisites for detailed requirements. In addition, notice the additional requirements for deploying via Redfish.
DPU Hardware: NVIDIA BlueField-3 DPUs
Minimal DOCA BFB Image: DOCA v2.5 or higher (must be pre-installed on DPUs to support DPF provisioning)
Supported DOCA BFB Image:
bf-bundle-3.2.1
Installation Notes
None.
Upgrade Notes
DPF supports upgrades from the immediate previous GA release. For more information refer to Lifecycle Management
[Zero-Trust] DpuDiscovery no longer creates DPUNodes:
Description: DpuDiscovery no longer creates DPUNodes by default. Site-admin can enable it by setting skipDPUNodeDiscovery to false in DPFOperatorConfig.
Example: Set skipDPUNodeDiscovery to false in DPFOperatorConfig:
apiVersion: operator.dpu.nvidia.com/v1alpha1
kind: DPFOperatorConfig
metadata:
name: dpfoperatorconfig
namespace: dpf-operator-system
spec:
provisioningController:
installInterface:
installViaRedfish:
skipDPUNodeDiscovery: false
Required field changes in DPUSet:
Description: The following fields in
DPUSetare now required and must be specified:spec.dpuTemplateis now requiredspec.dpuTemplate.spec.dpuFlavoris now required
Impact: Existing DPUSet resources that do not specify these fields will need to be updated before upgrading.
Note: These fields were made required because DPUSet resources cannot function properly without them. This change enforces validation that should have been in place from the beginning, making this effectively a bug fix rather than a breaking change for properly configured resources.
MTU minimum value changes:
Description: The minimum MTU value has been increased from 1000 to 1280 bytes across multiple API fields. Additionally, maximum MTU constraints have been added where applicable.
Affected Fields:
DPUServiceChain.spec.template.spec.template.spec.switches[*].serviceMTU- minimum increased: 1000 → 1280DPUDeployment.spec.serviceChains.switches[*].serviceMTU- minimum increased: 1000 → 1280ServiceChain.spec.switches[*].serviceMTU- minimum increased: 1000 → 1280ServiceChainSet.spec.template.spec.switches[*].serviceMTU- minimum increased: 1000 → 1280DPFOperatorConfig.spec.networking.controlPlaneMTU- minimum increased: 1000 → 1280DPFOperatorConfig.spec.networking.highSpeedMTU- minimum increased: 1000 → 1280DPUFlavor.spec.hostNetworkInterfaceConfigs[*].mtu- minimum increased: 1000 → 1280DPUServiceNAD.spec.serviceMTU- minimum constraint added: 1280, maximum constraint added: 9216
Impact: Existing resources with MTU values below 1280 will need to be updated. The system will reject values below the new minimum during validation.
DPUDeployment now requires at least one and up to 50 services:
Description: Previously, it was possible to define a DPUDeployment without any service. This was a bug in the validation which is now fixed. DPUDeployment now requires more than 1 and up 50 services.
Affected Fields:
DPUDeployment.spec.services- minimum constraint added when there was none previously : 1DPUDeployment.spec.services- maximum constraint added when there was none previously : 50
Impact: Existing DPUDeployments without services should be replaced with a DPUSet instead. The system will reject creation or update of a DPUDeployment with none or more than 50 services.
Deprecated API Fields
The following API fields have been deprecated and will be removed in future releases. Users should migrate to the recommended alternatives.
DPFOperatorConfig Fields:
Description: Multiple fields in DPFOperatorConfig have been deprecated.
The top-level
imagefields have been replaced with container-specific image override fields for more granular control.Registry configuration fields under
installViaRedfishhave been deprecated in favor of the top-levelregistryfield.The
deployInTargetClusterfield is deprecated and not supported.
Deprecated Fields:
Image fields (replaced with container-specific overrides):
spec.provisioningController.image→ Usespec.provisioningController.controller.imageinsteadspec.dpuServiceController.image→ Usespec.dpuServiceController.controller.imageinsteadspec.dpuDetector.image→ Usespec.dpuDetector.daemon.imageinsteadspec.kamajiClusterManager.image→ Usespec.kamajiClusterManager.controller.imageinsteadspec.staticClusterManager.image→ Usespec.staticClusterManager.controller.imageinsteadspec.serviceSetController.image→ Usespec.serviceSetController.controller.imageinsteadspec.flannel.image→ Usespec.flannel.cni.imageandspec.flannel.daemon.imageinsteadspec.multus.image→ Usespec.multus.cni.imageinsteadspec.nvipam.image→ Usespec.nvipam.controller.imageinsteadspec.sriovDevicePlugin.image→ Usespec.sriovDevicePlugin.deviceplugin.imageinsteadspec.ovsCNI.image→ Usespec.ovsCNI.cni.imageinsteadspec.sfcController.image→ Usespec.sfcController.controller.imageinsteadProvisioningController registry fields:
spec.provisioningController.installInterface.installViaRedfish.bfbRegistryAddress→ Usespec.provisioningController.registry.addressinsteadspec.provisioningController.installInterface.installViaRedfish.bfbRegistry→ Usespec.provisioningController.registryinsteadDeployInTargetCluster fields (not supported):
spec.serviceSetController.deployInTargetCluster→ This field is not supported and will be removedspec.flannel.deployInTargetCluster→ This field is not supported and will be removedspec.multus.deployInTargetCluster→ This field is not supported and will be removedspec.nvipam.deployInTargetCluster→ This field is not supported and will be removedspec.sriovDevicePlugin.deployInTargetCluster→ This field is not supported and will be removedspec.ovsCNI.deployInTargetCluster→ This field is not supported and will be removedspec.sfcController.deployInTargetCluster→ This field is not supported and will be removedspec.cniInstaller.deployInTargetCluster→ This field is not supported and will be removed
DPUDevice Fields:
Description: The following fields in
DPUDevice.spechave been deprecated in favor of their corresponding status fields, which are automatically populated by the system.Deprecated Fields:
spec.psid→ Usestatus.psidinsteadspec.opn→ Usestatus.opninsteadspec.pf0Name→ Usestatus.pf0Nameinstead
Note: The status fields are considered read-only and automatically populated by the system. Users should not attempt to modify these fields.
DPUNode Fields:
Description: The
gNOIreboot method andnodeDMSAddressfield have been deprecated.Deprecated Fields:
spec.nodeRebootMethod.gNOI→ Usespec.nodeRebootMethod.hostAgentinsteadspec.nodeDMSAddress→ This field is no longer used
DPUSet Fields:
Description: The
maxUnavailablefield in rolling update strategy has been deprecated.Deprecated Fields:
spec.strategy.rollingUpdate.maxUnavailable→ Usedpfoperatorconfig.spec.provisioningController.maxUnavailableDPUNodesinstead
DPUFlavor Fields:
Description: The
hostPowerCycleRequiredfield in NVConfig has been deprecated as it is unused.Deprecated Fields:
spec.nvconfig[].hostPowerCycleRequired→ This field is unused and deprecated
DPU Fields:
Description: The
bmcIPfield in DPU spec should be obtained from the associated DPUDevice instead.Deprecated Fields:
spec.bmcIP→ UseDPUDevice.spec.bmcIporDPUDevice.status.bmcIpinstead
DPUServiceNAD Fields:
Description: The
spec.metadatafield in DPUServiceNAD has been deprecated. The standard Kubernetes metadata field at the resource level remains available.Deprecated Fields:
spec.metadata→ This field is unused and deprecated
-
On initial deployment DPF CRs maybe stuck in initial state (Pending, Initializing, etc.) and not progressing
In case DPF CRs were created before DPF components are running they may be stuck in their initial state. DPF CRs need to be created after the DPF components have been deployed. In case CRs were created before they may remain in an initial state.
Internal Ref #4241297
Workaround: Delete any CRs that were created before the System components have been deployed and recreate them.
-
Stale ports after DPU reboot
When rebooting DPU, the old DPU service ports won’t get deleted from DPU’s OVS and would be stale
Internal Ref #4174183
Workaround: No workaround, known issue, shouldn’t affect performance.
-
DPU Cluster control-plane connectivity is lost when physical port P0 is down on the worker node
Link down of p0 port on the DPU will result in DPU control plane connectivity loss of DPU components.
Internal Ref #3751863
Workaround: Make sure P0 link is up on the DPU, if down either restart DPU or refer to DOCA troubleshooting https://docs.nvidia.com/networking/display/bfswtroubleshooting.
Note: This issue is relevant for Host Trusted deployments only.
-
nvidia-k8s-ipam and servicechainset-controller DPF system DPUServices are in “Pending” phase
As long as there are no provisioned DPUs in the system, the nvidia-k8s-ipam and servicechainset-controller will appear as not ready / pending when querying dpuservices. This has no impact on performance or functionality since DPF system components are only relevant when there are DPUs to provision services on.
Internal Ref #4241324
Workaround: No workaround, known issue
-
System doesn't recover after DPU Reset
When the user triggers a reset of a DPU in any way other than using DPF APIs (e.g. recreation of a DPU CR), the system may not recover.
Internal Ref #4521178, #4188044, #4732664
Workaround: Power cycle the host. Note that this operation is dangerous and there might be file system corruption which will require triggering the reprovisioning of the DPU. If the system does not recover, reprovision the DPU by deleting the DPU CR (it will be recreated and DPU provisioning will happen).
Note: This issue is relevant for Host Trusted deployments where OVN-Kubernetes is used as a primary CNI.
-
Leftover CRs if worker is removed from the cluster permanently
When a worker was added to a cluster, optionally had DPU provisioned and later was removed from the host cluster permanently, there may be leftover DPF related CRs in both the host cluster and the DPU cluster.
Internal Ref #4403130, #4571788
Workaround: No workaround, known issue.
-
DPUDeployment or DPUSet created in namespace different from the dpf-operator-system namespace do not trigger DPU provisioning
When creating a DPUDeployment or a DPUSet in a namespace other than dpf-operator-system, there are no DPU CRs created due to the DPUNode CRs residing in the dpf-operator-system namespace.
Internal Ref #4427091
Workaround: Create the DPUDeployment or DPUSet in the dpf-operator-system namespace.
-
DPUService stops reconciling when DPUCluster is unavailable for long time
When the DPUCluster is unavailable for long time (more than 5 mins), changes to DPUServices (also generated ones via DPUDeployment or DPFOperatorConfig) that have happened during that time might not be reflected to the DPUCluster.
Internal Ref #4359857
Workaround: Recreate DPUServices that are stuck.
-
DPUFlavor nvconfig limitations
DPUFlavor doesn't support assigning nvconfig parameters to only a specific device(s). Currently, it assigns the same parameters across all devices (i.e., '*')
Internal Ref #4583584
-
[OVN-Kubernetes DPUService] Nodes marked as NotReady
When installing OVN-Kubernetes as a CNI on a node running containerd version 1.7.0 and above the Node never becomes ready.
Internal Ref #4178221
Workaround:
Option 1: Use containerd version below v1.7.0 when using OVN-Kubernetes as a primary CNI.
Option 2: Manually restart containerd on the host.
-
[OVN-Kubernetes DPUService] Lost traffic from workloads to control plane components or Kubernetes services after dpu reboot, port flapping, ovs restart or manual network configuration
Connectivity issues between workload pods to control plane components or Kubernetes services may occur after the following events: DPU reboot without host reboot, high speed port flapping (link down/up), ovs restart, DPU network configuration change (for example using "netplan apply" command on DPU). The issues are caused by network configuration that was applied by ovn CNI on DPUs and won't get reapplied automatically.
Internal Ref #4188044, #4521178
Workaround: Recreate the OVN-Kubernetes node pod on the host to reapply the configuration.
-
[OVN-Kubernetes DPUService] host network configuration may result in lost traffic from host workloads (on overlay)
When changing host network (for example with netplan apply) custom network configuration which is done by the host CNI components may be lost and won’t reapply automatically.
Internal Ref #4188044
Workaround: Recreate the OVN-Kubernetes node pod on the host to reapply the configuration.
-
[OVN-Kubernetes DPUService] InternalTrafficPolicy and ExternalTrafficPolicy Service options are not handled correctly
When external traffic reaches a NodePort Kubernetes Service on the host cluster via the OOB Network of the host, the user won't see the relevant policy option working as expected. The same applies for traffic originating from a Pod on the overlay network hitting the same type of service.
Internal Ref #4320953
Workaround: No workaround, known issue.
-
[OVN-Kubernetes DPUService] No network connectivity for SR-IOV accelerated workload pods after DPU reboot
SR-IOV accelerated workload pod is losing its VF interface upon DPU reboot. VF is available on the host however not injected back into the pod.
Internal Ref #4236521
Workaround: Recreate the SR-IOV accelerated workload pods.
-
[HBN DPUService] Invalid HBN configuration is not reflected to user in case it is syntactically valid
If the HBN YAML configuration is valid but contains values that are illegal from an NVUE perspective then the HBN service will start with the last known valid configuration and it won’t be reflected to the end user.
Internal Ref #4555216
Workaround: No workaround, known issue.
-
[Firefly DPUService] Deletion of the Firefly DPUService leaves stale flows in OVS
When the Firefly DPUService is deleted after successful deployment or the labels of the serviceDaemonSet are modified, flows are not cleaned up from OVS.
Internal Ref #4382535
Workaround: Although these flows are very unlikely to cause an issue, reprovisioning the DPU or power cycling the host will bring the OVS in good state. Note that power cycling the host is dangerous and there might be file system corruption which will require triggering the reprovisioning of the DPU.
-
Changing DPUNode nodeRebootMethod during DPU provisioning is not supported
When a DPU is not in Ready or Error phase, updating the DPUNode
spec.nodeRebootMethodis not supported. Updatingspec.nodeRebootMethodat this time may cause the DPU to get stuck inRebootingphase.Internal Ref: #4641260
Workaround: Delete the DPU object to trigger reprovisioning.
-
DPU may fail to reboot during Rebooting phase if firmware version is too old
If the firmware version of the DPU is older than version 32.41.1000 (released April 2024), DPUs may fail to provision with an error during the DPU reboot phase.
Internal Ref: #4634335
Workaround: Manually upgrade the firmware version of the DPU to 32.41.1000 or later, or set
spec.nodeRebootMethodtoexternal(and reboot the node at the correct time).
-
[DPUDeployment] Pods of service referenced in spec.serviceChain are restarted when serviceChain changes, causing disruption
When updating DPUDeployment.spec.serviceChain with a disruptive upgrade, the affected pods are restarted but DPUDeployment does not remove the node effect from the corev1.Node in the host cluster. This causes the node to be marked as ready and accept workloads before DPU services are fully operational.
Internal Ref: #4686635
Workaround: No workaround, known issue.
-
[DPUDeployment] Invalid configuration in service and service chain related configuration may lead to DPUDeployment making no progress, even after reverting to a working configuration.
When an invalid change is introduced in DPUServiceConfiguration/DPUServiceTemplate or the DPUDeployment.spec.serviceChain, the DPUDeployment may get stuck. This happens when one or more DPUs get to the Node Effect Removal state. The DPU is stuck because the underlying component doesn't become ready (Pod/ServiceChain). Even reverting to the working configuration won't work because the DPU controller doesn't adjust the labels on the corev1.Node in the DPUCluster when the DPU is in Node Effect Removal state.
Internal Ref: #4713965
Workaround: Revert to a working configuration and remove the requestors from the DPUNodeMaintenance objects.
-
No network traffic through DPU when the system is properly configured and all related DPF resources are ready
In rare cases, a race condition during the initialization of the OVS instance running on the DPU may prevent traffic from passing through the DPU.
Internal Ref: #4719221
Workaround: The following workarounds are possible:
Step 1: Powercycle the host
Step 2: Add the following command to DPUFlavor
spec.ovsfield:ovs-appctl dpctl/del-dp system@ovs-system
-
DPU provisioning in Zero-Trust may fail if the provisioning controller has restarted during DPU provisioning
If the provisioning controller restarts while a DPU is being provisioned, the DPU will transition to the
Errorphase. In the provisioning controller logs, a400 Bad Requesterror is seen with messageAn update is in progress.while communicating with the DPU BMC.Internal Ref: #4723795
Workaround: Wait a few minutes for the current in-flight DPU provisioning process to be completed by the DPU BMC, then delete the DPU object to restart provisioning.
-
DOCA SNAP may fail to reattach a DPUVolumeAttachment that uses VirtioFS on hot-plugged PFs after a host power cycle
After a host power cycle, a DPUVolumeAttachment that uses VirtioFS on hot-plugged PFs can become not ready and might not be restored automatically.
Internal Ref: #4734537
Workaround:
Step 1: Delete failed DPUVolumeAttachment.
Step 2: Create a new DPUVolumeAttachment with the same configuration.
-
DOCA SNAP may fail to reattach a DPUVolumeAttachment that uses emulated NVMe devices on hot-plugged PFs after a host power cycle
After a host power cycle, a DPUVolumeAttachment that uses emulated NVMe devices on hot-plugged PFs can become not ready and might not be restored automatically.
Internal Ref: #4734514
Workaround:
Step 1: Mark the DPUVolumeAttachment for deletion. The object removal will be blocked by the finalizer at this step.
Step 2: Access the API of the DPU cluster and remove the finalizer from the VolumeAttachment CR that has the same name as the DPUVolumeAttachment CR in the host cluster.
Step 3: Wait for the DPUVolumeAttachment to be fully removed in the host cluster.
Step 4: Power cycle the host.
Step 5: After the host is fully booted, create a new DPUVolumeAttachment with the same configuration.
-
In rare cases, a DPUVolumeAttachment that uses emulated NVMe devices on hot-plugged PFs may reach the Ready state but have an invalid (00:00.0) PCI device address in the CR status
If the host is still booting while a DPUVolumeAttachment is in progress, the DPUVolumeAttachment may reach the Ready state but have an invalid (00:00.0) PCI device address, which prevents the volume from being consumed by workloads on the host.
Internal Ref: #4734509
Workaround:
Step 1: Mark the DPUVolumeAttachment for deletion. The object removal will be blocked by the finalizer at this step.
Step 2: Access the API of the DPU cluster and remove the finalizer from the VolumeAttachment CR that has the same name as the DPUVolumeAttachment CR in the host cluster.
Step 3: Wait for the DPUVolumeAttachment to be fully removed in the host cluster.
Step 4: Power cycle the host.
Step 5: After the host is fully booted, create a new DPUVolumeAttachment with the same configuration.
-
Long DPU provisioning time when multiple DPUs are provisioned on the same node
In some cases, it may take a long amount of time to provision multiple DPUs on the same node.
Internal Ref: #4757191
Workaround: If the DPU is stuck in
DPU Cluster Configphase after a long installation, please reprovision the DPU and make sure no other DPUs on the same node are being provisioned at the same time
-
DPF does not support HBN with EVPN configuration together with OVN-Kubernetes or DPF OVN VPC services
Mixing GENEVE(OVN) and VXLAN(EVPN) tunnel types in HBN will cause traffic failure
Internal Ref: #4776088
Workaround: No workaround. This configuration should be avoided.