DOCA Platform Framework v25.7.0
This is the GA release of the DOCA Platform Framework (DPF). It includes bug fixes and improvements to enhance the provisioning and orchestration of NVIDIA BlueField DPUs in Kubernetes environments.
New DOCA services are now deployable as DPUServices:
DOCA Argus service
DPU Zero Trust is now Beta:
VPC Service support: Enables centralized VPC network creation and configuration on BlueField DPUs with a declarative API.
SNAP based storage volume management: Ability to expose storage resources to the host as emulated NVMe devices or as a VirtioFS filesystem.
Auto-discovery of DPUs: DPU devices can be automatic discovered in the fabric removing the need to manually create an object for each DPUDevice with its details.
Auto update of old BMC Versions: DPF will automatically identify old BMC versions and update them to the minimal version, revmoving the need to manually update the BMC from an old version.
Improved DPU Management and Kubernetes Integration:
DPUDeployment service dependency management: DPF now supports service deployment dependencies, allowing dependent service Helm values to be templated with items from dependencies.
Workload prevention on Kubernetes host: Improved host worker node and DPUServices life cycle synchronization. Kubernetes host worker node would be marked as NotReady until all the required services on DPU are running
Detailed information about the fixed issues can be found in the release notes for v25.4.0.
Incompatible DPUFlavor can cause DPU to get into an unstable state
BFB filename must be unique
DPUDeployment spec.dpuSets.nodeEffect changes require recreation of the CR
DPUDeployment with duplicate nameSuffix in dpuSets causes DPF to ignore the first entry and triggers endless reprovisioning
Deleting ArgoCD pods on a cluster with workers causes pod to be created on worker nodes
Service Function Chain(SFC) is configured only when all chain components are installed and correct
dmsinit.sh fails to fetch newly created certificate secret on first run for non Kubernetes workers
Instability with DPU provisioning when NFS service is not available
DPUDeployment should specify supported spec.dpuSets.nodeEffect when deploying on DPUs that are part of servers that are not Kubernetes workers
[HBN + OVN-Kubernetes DPUServices] HBN service restarts on DPU causes worker to lose traffic
OVN-Kubernetes
-
Supports NVMe and VirtioFS emulation for DPF deployments in non-trusted mode
Supports NVMe emulation for DPF deployments in trusted mode
Supports Kubernetes PersistentVolume with volumeMode: Block for DPF deployments in trusted mode
Hardware and Software Requirements
For Host Trusted mode: Refer to Host Trusted Prerequisites for detailed requirements.
For Zero Trust mode: Refer to Zero Turst Prerequisites for detailed requirements. In addition, notice the additional requirements for deploying via Redfish.
DPU Hardware: NVIDIA BlueField-3 DPUs
Minimal DOCA BFB Image: DOCA v2.5 or higher (must be pre-installed on DPUs to support DPF provisioning)
Supported DOCA BFB Image:
bf-bundle-3.1.0
Installation Changes
Starting with DPF v25.7, all Helm dependencies have been removed from the DPF chart. This means that all dependencies must be installed manually before installing the DPF chart itself.
For detailed installation instructions and configuration values, refer to the Helm Prerequisites documentation.
Upgrade Support
DPF v25.7 does not support upgrades from previous versions. Users must completely uninstall/purge all DPF components and install on a fresh system.
Future Upgrade Support: Starting with DPF v25.10, we plan to introduce proper upgrade support that will allow seamless upgrades between compatible versions.
-
On initial deployment DPF CRs maybe stuck in initial state (Pending, Initializing, etc.) and not progressing
In case DPF CRs were created before DPF components are running they may be stuck in their initial state. DPF CRs need to be created after the DPF components have been deployed. In case CRs were created before they may remain in an initial state.
Internal Ref #4241297
Workaround: Delete any CRs that were created before the System components have been deployed and recreate them.
-
Stale ports after DPU reboot
When rebooting DPU, the old DPU service ports won’t get deleted from DPU’s OVS and would be stale
Internal Ref #4174183
Workaround: No workaround, known issue, shouldn’t affect performance.
-
DPU Cluster control-plane connectivity is lost when physical port P0 is down on the worker node
Link down of p0 port on the DPU will result in DPU control plane connectivity loss of DPU components.
Internal Ref #3751863
Workaround: Make sure P0 link is up on the DPU, if down either restart DPU or refer to DOCA troubleshooting https://docs.nvidia.com/networking/display/bfswtroubleshooting. Note: This issue is relevant for Host Trusted deployments only.
-
Cluster MTU value cannot be dynamically changed
It is possible to deploy DPF cluster with a custom MTU value, however once deployed, it is not possible to modify the MTU value which is applied on multiple distributed components.
Internal Ref #3917006
Workaround: Uninstall DPF and re-install from scratch using the new MTU value.
-
nvidia-k8s-ipam and servicechainset-controller DPF system DPUServices are in “Pending” phase
As long as there are no provisioned DPUs in the system, the nvidia-k8s-ipam and servicechainset-controller will appear as not ready / pending when querying dpuservices. This has no impact on performance or functionality since DPF system components are only relevant when there are DPUs to provision services on.
Internal Ref #4241324
Workaround: No workaround, known issue
-
System doesn't recover after DPU Reset
When the user triggers a reset of a DPU in any way other than using DPF APIs (e.g. recreation of a DPU CR), the system may not recover.
Internal Ref #4424305, #4424235, #4188044
Workaround: Power cycle the host. Note that this operation is dangerous and there might be file system corruption which will require triggering the reprovisioning of the DPU. Note: This issue is relevant for Host Trusted deployments where OVN-Kubernetes is used as a primary CNI.
-
DPUDeployment disruptive upgrade handling is not always graceful
When a DPUDeployment triggers a disruptive upgrade (default behavior), there is a case that the upgrade will not trigger the related nodeEffect. This can happen when there were changes only to the DPUService related fields.
Internal Ref #4252072
Workaround: No workaround, known issue.
-
Leftover CRs if worker is removed from the cluster permanently
When a worker was added to a cluster, optionally had DPU provisioned and later was removed from the host cluster permanently, there may be leftover DPF related CRs in both the host cluster and the DPU cluster.
Internal Ref #4403130, #4426516
Workaround: No workaround, known issue.
-
DPUSet and BFB removal leaves DPU CRs in OS Installing Phase indefinitely
When a DPUSet that owns DPU CRs that are in OS Installing phase is deleted together with the BFB that the DPUSet is referencing, the DPUs are stuck in OS Installing Phase indefinitely and can't be removed. This can also be a race condition when deleting the DPFOperatorConfig CR and there are DPUs in OS Installing Phase.
Internal Ref #4426349
Workaround: No workaround, known issue.
-
DPUDeployment or DPUSet created in namespace different from the dpf-operator-system namespace do not trigger DPU provisioning
When creating a DPUDeployment or a DPUSet in a namespace other than dpf-operator-system, there are no DPU CRs created due to the DPUNode CRs residing in the dpf-operator-system namespace.
Internal Ref #4427091
Workaround: Create the DPUDeployment or DPUSet in the dpf-operator-system namespace.
-
DPUService stops reconciling when DPUCluster is unavailable for long time
When the DPUCluster is unavailable for long time (more than 5 mins), changes to DPUServices (also generated ones via DPUDeployment or DPFOperatorConfig) that have happened during that time might not be reflected to the DPUCluster.
Internal Ref #4359857
Workaround: Recreate DPUServices that are stuck.
-
DPUFlavor nvconfig limitations
DPUFlavor doesn't support assigning nvconfig parameters to only a specific device(s). Currently, it assigns the same parameters across all devices (i.e., '*')
Internal Ref #4583584
-
DPUFlavor nvconfig.hostPowerCycleRequired is not used
The
nvconfig.hostPowerCycleRequired
parameter is defined in the DPUFlavor spec but is not used in the code.Internal Ref #4583693
-
[OVN-Kubernetes DPUService] Nodes marked as NotReady
When installing OVN-Kubernetes as a CNI on a node running containerd version 1.7.0 and above the Node never becomes ready.
Internal Ref #4178221
Workaround: Option 1: Use containerd version below v1.7.0 when using OVN-Kubernetes as a primary CNI. Option 2: Manually restart containerd on the host.
-
[OVN-Kubernetes DPUService] Lost traffic from workloads to control plane components or Kubernetes services after dpu reboot, port flapping, ovs restart or manual network configuration
Connectivity issues between workload pods to control plane components or Kubernetes services may occur after the following events: DPU reboot without host reboot, high speed port flapping (link down/up), ovs restart, DPU network configuration change (for example using "netplan apply" command on DPU). The issues are caused by network configuration that was applied by ovn CNI on DPUs and won't get reapplied automatically.
Internal Ref #4424305, #4188044, #4424235
Workaround: Recreate the OVN-Kubernetes node pod on the host to reapply the configuration.
-
[OVN-Kubernetes DPUService] host network configuration may result in lost traffic from host workloads (on overlay)
When changing host network (for example with netplan apply) custom network configuration which is done by the host CNI components may be lost and won’t reapply automatically.
Internal Ref #4188044
Workaround: Recreate the OVN-Kubernetes node pod on the host to reapply the configuration.
-
[OVN-Kubernetes DPUService] InternalTrafficPolicy and ExternalTrafficPolicy Service options are not handled correctly
When external traffic reaches a NodePort Kubernetes Service on the host cluster via the OOB Network of the host, the user won't see the relevant policy option working as expected. The same applies for traffic originating from a Pod on the overlay network hitting the same type of service.
Internal Ref #4320953
Workaround: No workaround, known issue.
-
[OVN-Kubernetes DPUService] No network connectivity for SR-IOV accelerated workload pods after DPU reboot
SR-IOV accelerated workload pod is losing its VF interface upon DPU reboot. VF is available on the host however not injected back into the pod.
Internal Ref #4236521
Workaround: Recreate the SR-IOV accelerated workload pods.
-
[HBN DPUService] Invalid HBN configuration is not reflected to user in case it is syntactically valid
If the HBN YAML configuration is valid but contains values that are illegal from an NVUE perspective then the HBN service will start with the last known valid configuration and it won’t be reflected to the end user.
Internal Ref #4172029
Workaround: No workaround, known issue.
-
[Firefly DPUService] Deletion of the Firefly DPUService leaves stale flows in OVS
When the Firefly DPUService is deleted after successful deployment or the labels of the serviceDaemonSet are modified, flows are not cleaned up from OVS.
Internal Ref #4382535
Workaround: Although these flows are very unlikely to cause an issue, reprovisioning the DPU or power cycling the host will bring the OVS in good state. Note that power cycling the host is dangerous and there might be file system corruption which will require triggering the reprovisioning of the DPU.
-
[VPC OVN DPUService] E/W Traffic Not Passing within Same VPC after DPU Reprovisioning:
After DPU reprovisioning, there is a chance to experience issues where east-west (E/W) traffic is not passing between nodes under the same VPC. This is because OVN-controller failed to register chassis in ovn-sb DB. This may also happen when VPC is completely removed and re-applied
Internal Ref: #4553212
Workaround:
Verify the ServiceInterfaces are ready on all nodes.
Confirm that these ServiceInterfaces are associated with the same DPUVirtualNetwork and are part of the expected DPUVPC.
Locate the ovn-central pod running in the management cluster, example:
sh kubectl get pods -n dpf-operator-system | grep ovn-central
run the following command in ovn-central pod to remove leftover chassis records:
sh kubectl exec -it -n dpf-operator-system <in-cluster-ovn-central> -- \ sh -c 'for chassis in $(ovn-sbctl list Chassis | awk "/_uuid/ {print \$3}"); do ovn-sbctl destroy Chassis $chassis; done'
-
DPUServiceInterface doesn't wait for ServiceInterface to be deleted
When a DPUServiceInterface, DPUVPC and DPUVirtualNetwork are deleted, the DPUServiceInterface reconciler doesn't wait for the underlying ServiceInterfaces to be deleted before proceeding. This causes DPUVirtualnetwork and DPUVPC to be removed and consequently the vpc serviceinterface controller cannot properly unplug a serviceinterface from vpc since the objects are already deleted (it waits for them to be present).
Internal Ref #4578503