DOCA Platform Framework v25.1.1
This patch release of the DOCA Platform Framework (DPF) includes bug fixes and improvements to enhance the provisioning and orchestration of NVIDIA BlueField DPUs in Kubernetes environments.
Detailed information about the fixed issues can be found in the release notes for v25.1.
-
[HBN + OVN-Kubernetes DPUServices] HBN service restarts on DPU causes worker to lose traffic
If the HBN pod on the DPU will reset then the workloads on the host (any traffic on the OVN overlay) will not receive traffic.
Internal Ref #4220185, #4223176
Workaround: Deploy the HBN DPUService with the below values to update the helm chart and image tag:
yaml spec: helmChart: source: repoURL: $NGC_HELM_REGISTRY_REPO_URL version: 1.0.2 chart: doca-hbn values: image: repository: $HBN_NGC_IMAGE_URL tag: 2.4.2-doca2.9.2-32
OVN-Kubernetes with OVS offload to the DPU
Hardware and Software Requirements
Refer to Prerequisites for detailed requirements.
DPU Hardware: NVIDIA BlueField-3 DPUs
Minimal DOCA BFB Image: DOCA v2.5 or higher (must be pre-installed on DPUs to support DPF provisioning)
Provisioned DOCA BFB Image:
bf-bundle-2.9.1-40
-
On initial deployment DPF CRs maybe stuck in initial state (Pending, Initializing, etc.) and not progressing
In case DPF CRs were created before DPF components are running they maybe be stuck in their initial state. DPF CRs need to be created after the DPF components have been deployed. In case CRs were created before they may remain in an initial state.
Internal Ref #4241297
Workaround: Delete any CRs that were created before the System components have been deployed and recreate them.
-
DPUService stuck in its Deleting phase
DPUService can be stuck in its Deleting phase when a pod on the DPU created as part of the DPUService can not be deleted.
Internal Ref #4213229
Workaround: Reprovision the DPU where the DPUService is deployed.
-
Incompatible DPUFlavor can cause DPU to get into an unstable state
Using an incompatible DPUFlavor can cause the DPU Device to get into an error state which requires manual intervention. For example allocating 14GB of hugepages in a DPU of 16GB memory.
Internal Ref #4200717
Workaround: Manually provision DPU or follow DOCA troubleshooting documentation to return DPU to operational state https://docs.nvidia.com/networking/display/bfswtroubleshooting.
-
Traffic loss after reconfiguration of DPUServices with chain between
Reconfiguration of DPUServices with chain between them may cause traffic loss due to outdated service chains.
Internal Ref #4178445
Workaround: Recreated SFC object between services.
-
Stale ports after DPU reboot
When rebooting DPU, the old DPU service ports won’t get deleted from DPU’s OVS and would be stale
Internal Ref #4174183
Workaround: No workaround, known issue, shouldn’t affect performance.
-
BFB filename must be unique
If BFB CR#1 bfb.spec.filename is the same as a BFB CR#2 bfb.spec.filename but references a different URL (actual bfb file to download) then BFB CR#1 would reference the wrong bfb.
Internal Ref #4143309
Workaround: Use unique bfb.spec.filename when creating new bfb CRs.
-
DPU Cluster control-plane connectivity is lost when physical port P0 is down on the worker node
Link down of p0 port on the DPU will result in DPU control plane connectivity loss of DPU components.
Internal Ref #3751863
Workaround: Make sure P0 link is up on the DPU, if down either restart DPU or refer to DOCA troubleshooting https://docs.nvidia.com/networking/display/bfswtroubleshooting.
-
DPU Provisioning operations wouldn’t be retried
DPU Provisioning operations wouldn’t be retried, this can lead to DPU object in an error phase because of small environment glitch which would have worked if retried.
Internal Ref #4202272
Workaround: Delete the DPU object in an error phase which will cause it to get recreated and operation to begin from scratch.
-
Cluster MTU value cannot be dynamically changed
It is possible to deploy DPF cluster with a custom MTU value, however once deployed, it is not possible to modify the MTU value which is applied on multiple distributed components.
Internal Ref #3917006
Workaround: Uninstall DPF and re-install from scratch using the new MTU value.
-
nvidia-k8s-ipam and servicechainset-controller DPF system DPUServices are in “Pending” phase
As long as there are no provisioned DPUs in the system, the nvidia-k8s-ipam and servicechainset-controller will appear as not ready / pending when querying dpuservices.
This has no impact on performance or functionality since DPF system components are only relevant when there are DPUs to provision services on.
Internal Ref #4241324
Workaround: No workaround, known issue
-
[OVN-Kubernetes DPUService] Nodes marked as NotReady
When installing OVN-Kubernetes as a CNI on a node running containerd version 1.7.0 and above the Node never becomes ready.
Internal Ref #4178221
Workaround:
Option 1: Use containerd version below v1.7.0 when using OVN-Kubernetes as a primary CNI.
Option 2: Manually restart containerd on the host.
-
[OVN-Kubernetes DPUService] control plane node is not functional after reboot or network restart
During OVN-Kubernetes CNI installation on the control plane nodes, the management interface is moved with its IP into a newly created OVS bridge. Since this network configuration is not persistent it will be lost during node or network restart.
Internal Ref #4241306
Workaround: 1) Pre-define the OVS bridge on each control plane node with the OOB port MAC and IP address and ensure it gets a persistent IP
yaml #Ubuntu example for netplan persistent network configuration: network: ethernets: oob: match: # the mac address of the oob macaddress: xx:xx:xx:xx:xx:xx set-name: oob bridges: br0: addresses: x.x.x.x/x interfaces: [oob] # the mac address of the oob macaddress: xx:xx:xx:xx:xx:xx openvswitch: {} version: 2
2) Set OVS bridge "bridge-uplink" in OVS metadata.
bash ovs-vsctl br-set-external-id br0 bridge-id br0 -- br-set-external-id br0 bridge-uplink oob
-
[OVN-Kubernetes DPUService] Only a single OVN-Kubernetes DPU service version can be deployed across the cluster
OVN-Kubernetes service does not fully support customization using Helm parameters, meaning we only support a single OVN-Kubernetes DPU service across the entire cluster.
Internal Ref #4209524
Workaround: No workaround, known limitation.
-
[OVN-Kubernetes DPUService] Lost traffic from workloads to control plane components or K8S services after dpu reboot, port flapping, ovs restart or manual network configuration
Connectivity issues between workload pods to control plane components or K8S services may occur after the following events: DPU reboot without host reboot, high speed port flapping (link down/up), ovs restart, DPU network configuration change (for example using "netplan apply" command on DPU).
The issues are caused by network configuration that was applied by ovn CNI on DPUs and won't get reapplied automatically.
When rebooting DPU without the host, or high speed port link is going down/up, or manually changing dpu network ( for example with netplan apply), network configuration which was applied by the dpu CNI components may be lost and won’t reapply automatically.
Internal Ref #4202272
Workaround: Recreate the OVN-Kubernetes node pod on the host to reapply the configuration.
-
[OVN-Kubernetes DPUService] host network configuration may result in lost traffic from host workloads (on overlay)
When changing host network (for example with netplan apply) custom network configuration which is done by the host CNI components may be lost and won’t reapply automatically.
Internal Ref #4188044
Workaround: Recreate the OVN-Kubernetes node pod on the host to reapply the configuration.
-
[OVN-Kubernetes DPUService] No network connectivity for SR-IOV accelerated workload pods after DPU reboot
SR-IOV accelerated workload pod is losing its VF interface upon DPU reboot. VF is available on the host however not injected back into the pod.
Internal Ref #4236521
Workaround: Recreate the SR-IOV accelerated workload pods.
-
[HBN DPUService] HBN DPUService cannot dynamically reload configurations
When updating HBN configuration through a configmap, the running HBN container won’t reload it, and need to get restarted with the new configuration.
Internal Ref #4290426
Workaround: Recreate HBN DPU service after changing configuration.
-
[HBN DPUService] Invalid HBN configuration is not reflected to user in case it is syntactically valid
If the HBN YAML configuration is valid but contains values that are illegal from an NVUE perspective then the HBN service will start with the last known valid configuration and it won’t be reflected to the end user.
Workaround: No workaround, known issue.
-
[DTS DPUService] DTS appears as OutOfSync
When creating a DPUDeployment for DTS DPU service, the DPUService object can be marked as OutOfSync although the pods are running on the DPUs.
Internal Ref #4182929
Workaround: No workaround, known issue.
-
[OVN-Kubernetes DPUService] SR-IOV test Pod cannot reach Kubernetes API Service
When running a SR-IOV test Pod, the pod cannot reach a Kubernetes API Service. The issue is that the related conntrack entries miss the un-nat sometimes.
Internal Ref #4313629
Workaround: Unblock is to run the following on the DPUs:
ovs-appctl revalidator/purge ovs-appctl dpctl/flush-conntrack