Network#
Should generally follow the common network architecture for North-South and SpectrumX for East-West. In the case of our lab, all the rails from both RTX 6K servers are facing the same switch, so we have skipped most of the advanced features of the network.
Driver Installation#
Follow https://docs.nvidia.com/doca/sdk/DOCA-Host-Installation-and-Upgrade/index.html.
Download and install doca-all Host drivers for your operating system using either of two
proposed methods (Local or Online).
Update the DPU OS and Firmware#
Start mst with sudo mst start, check the status of the DPU devices and their interface names
on the host with sudo mst status -v.
Make sure you have rshim driver running on the host: sudo systemctl status rshim.
Check all RShim devices are visible on the host:
sudo cat /dev/rshim*/misc | grep DEV_NAME
If one or a few rshim devices is missing try to restart rshim driver:
sudo systemctl restart rshim
If the issue persists, uncomment FORCE_MODE 1 in /etc/rshim.conf and restart the rshim
driver once again. This should forcefully connect the rshim to the host.
Download the latest / recommended BFB from https://developer.nvidia.com/doca-downloads?deployment_platform=BlueField&deployment_package=BF-FW-Bundle&installer_type=BFB.
Collect rshim indices with ls /dev/rshim*. E.g.:
/dev/rshim0:
boot console misc rshim
/dev/rshim1:
boot console misc rshim
/dev/rshim2:
boot console misc rshim
/dev/rshim3:
boot console misc rshim
/dev/rshim4:
boot console misc rshim
Then run the bash script to update DPU devices with the latest BFB using indices collected above. Adjust the name of the bfb file accordingly:
for i in {0,1,2,3,4}; do \
sudo bfb-install --bfb bf-fwbundle-3.1.0-76_25.07-prod.bfb --rshim rshim$i; \
done
Upon completion, cold boot the node(s).
Verify the firmware and DPU OS matches the expected versions. Using the output from
sudo mst status -v put together the script similar to:
for i in {0,1,2,3,4}; do \
sudo /usr/bin/flint -d /dev/mst/mt41692_pciconf$i q | egrep FW; \
done
Example output:
FW Version: 32.46.1006
FW Release Date: 31.7.2025
FW Version: 32.46.1006
FW Release Date: 31.7.2025
FW Version: 32.46.1006
FW Release Date: 31.7.2025
FW Version: 32.46.1006
FW Release Date: 31.7.2025
FW Version: 32.46.1006
FW Release Date: 31.7.2025
Network Operator Installation#
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install network-operator nvidia/network-operator \
-n nvidia-network-operator \
--create-namespace \
--values=values.yaml \
--wait
Use the following values.yaml with the command above:
sriovNetworkOperator:
enabled: true
Next we’re going for the simplified setup not involving the accelerated OVS.
Define and Apply the NicClusterPolicy#
cat << EOF > nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
nvIpam:
image: nvidia-k8s-ipam
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
enableWebhook: false
secondaryNetwork:
cniPlugins:
image: plugins
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
multus:
image: multus-cni
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
EOF
kubectl apply -f nic-cluster-policy.yaml
Define IP Pools#
Here we go with a very simple “flat” addressing scheme that is consistent with the rest of the network setup:
cat << EOF > ip-pool.yaml
---
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
name: rail-1
namespace: nvidia-network-operator
spec:
subnet: 192.168.16.0/24
perNodeBlockSize: 28
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition
---
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
name: rail-2
namespace: nvidia-network-operator
spec:
subnet: 192.168.17.0/24
perNodeBlockSize: 28
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition
---
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
name: rail-3
namespace: nvidia-network-operator
spec:
subnet: 192.168.18.0/24
perNodeBlockSize: 28
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition
---
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
name: rail-4
namespace: nvidia-network-operator
spec:
subnet: 192.168.19.0/24
perNodeBlockSize: 28
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition
EOF
kubectl apply -f ip-pool.yaml
Define SriovNetworks#
Next we define SriovNetworks, basically rails for the MultiNode applications:
cat << EOF > sriov-networks.yaml
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: rail-1
namespace: nvidia-network-operator
spec:
ipam: '{"type": "nv-ipam", "poolName": "rail-1", "poolType": "ippool"}'
networkNamespace: default
resourceName: rail-1
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: rail-2
namespace: nvidia-network-operator
spec:
ipam: '{"type": "nv-ipam", "poolName": "rail-2", "poolType": "ippool"}'
networkNamespace: default
resourceName: rail-2
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: rail-3
namespace: nvidia-network-operator
spec:
ipam: '{"type": "nv-ipam", "poolName": "rail-3", "poolType": "ippool"}'
networkNamespace: default
resourceName: rail-3
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: rail-4
namespace: nvidia-network-operator
spec:
ipam: '{"type": "nv-ipam", "poolName": "rail-4", "poolType": "ippool"}'
networkNamespace: default
resourceName: rail-4
EOF
kubectl apply -f sriov-networks.yaml
Define the Network Node Policy#
cat << EOF > sriov-network-node-policy.yaml
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: rail-1
namespace: nvidia-network-operator
spec:
eSwitchMode: legacy
mtu: 9216
nicSelector:
pfNames: ["ens15f0np0"]
numVfs: 8
isRdma: true
linkType: ETH
resourceName: rail-1
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition"
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: rail-2
namespace: nvidia-network-operator
spec:
eSwitchMode: legacy
mtu: 9216
nicSelector:
pfNames: ["ens16f0np0"]
numVfs: 8
isRdma: true
linkType: ETH
resourceName: rail-2
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition"
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: rail-3
namespace: nvidia-network-operator
spec:
eSwitchMode: legacy
mtu: 9216
nicSelector:
pfNames: ["ens21f0np0"]
numVfs: 8
isRdma: true
linkType: ETH
resourceName: rail-3
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition"
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: rail-4
namespace: nvidia-network-operator
spec:
eSwitchMode: legacy
mtu: 9216
nicSelector:
pfNames: ["ens20f0np0"]
numVfs: 8
isRdma: true
linkType: ETH
resourceName: rail-4
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition"
EOF
kubectl apply -f sriov-network-node-policy.yaml
Spawn a Test DaemonSet to Test the Connectivity#
cat << EOF > test-ds.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: sample
spec:
selector:
matchLabels:
name: sample
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: rail-1, rail-2, rail-3, rail-4
labels:
name: sample
spec:
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition"
containers:
- name: sample
command: ["bash"]
args: ["-c", "sleep infinity & wait"]
lifecycle:
preStop:
exec:
command: ["pkill","sleep"]
image: docker.io/deepops/nccl-tests:2312
resources:
limits:
nvidia.com/rail-1: "1"
nvidia.com/rail-2: "1"
nvidia.com/rail-3: "1"
nvidia.com/rail-4: "1"
nvidia.com/gpu: "2"
requests:
nvidia.com/rail-1: "1"
nvidia.com/rail-2: "1"
nvidia.com/rail-3: "1"
nvidia.com/rail-4: "1"
nvidia.com/gpu: "2"
securityContext:
capabilities:
add:
- NET_ADMIN
- IPC_LOCK
EOF
kubectl apply -f test-ds.yaml
In a pod of the DaemonSet above you should be able to see 4 additional interfaces, and you should be able to ping other interfaces in every other pod of the same DaemonSet across the 4 rails.