Network#

Should generally follow the common network architecture for North-South and SpectrumX for East-West. In the case of our lab, all the rails from both RTX 6K servers are facing the same switch, so we have skipped most of the advanced features of the network.

Driver Installation#

Follow https://docs.nvidia.com/doca/sdk/DOCA-Host-Installation-and-Upgrade/index.html. Download and install doca-all Host drivers for your operating system using either of two proposed methods (Local or Online).

Update the DPU OS and Firmware#

Start mst with sudo mst start, check the status of the DPU devices and their interface names on the host with sudo mst status -v.

Make sure you have rshim driver running on the host: sudo systemctl status rshim.

Check all RShim devices are visible on the host:

sudo cat /dev/rshim*/misc | grep DEV_NAME

If one or a few rshim devices is missing try to restart rshim driver:

sudo systemctl restart rshim

If the issue persists, uncomment FORCE_MODE 1 in /etc/rshim.conf and restart the rshim driver once again. This should forcefully connect the rshim to the host.

Download the latest / recommended BFB from https://developer.nvidia.com/doca-downloads?deployment_platform=BlueField&deployment_package=BF-FW-Bundle&installer_type=BFB.

Collect rshim indices with ls /dev/rshim*. E.g.:

/dev/rshim0:
boot  console  misc  rshim
/dev/rshim1:
boot  console  misc  rshim
/dev/rshim2:
boot  console  misc  rshim
/dev/rshim3:
boot  console  misc  rshim
/dev/rshim4:
boot  console  misc  rshim

Then run the bash script to update DPU devices with the latest BFB using indices collected above. Adjust the name of the bfb file accordingly:

for i in {0,1,2,3,4}; do \
  sudo bfb-install --bfb bf-fwbundle-3.1.0-76_25.07-prod.bfb --rshim rshim$i; \
done

Upon completion, cold boot the node(s).

Verify the firmware and DPU OS matches the expected versions. Using the output from sudo mst status -v put together the script similar to:

for i in {0,1,2,3,4}; do \
  sudo /usr/bin/flint -d /dev/mst/mt41692_pciconf$i q | egrep FW; \
done

Example output:

FW Version:      32.46.1006
FW Release Date: 31.7.2025
FW Version:      32.46.1006
FW Release Date: 31.7.2025
FW Version:      32.46.1006
FW Release Date: 31.7.2025
FW Version:      32.46.1006
FW Release Date: 31.7.2025
FW Version:      32.46.1006
FW Release Date: 31.7.2025

Network Operator Installation#

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install network-operator nvidia/network-operator \
  -n nvidia-network-operator \
  --create-namespace \
  --values=values.yaml \
  --wait

Use the following values.yaml with the command above:

sriovNetworkOperator:
  enabled: true

Next we’re going for the simplified setup not involving the accelerated OVS.

Define and Apply the NicClusterPolicy#

cat << EOF > nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  nvIpam:
    image: nvidia-k8s-ipam
    repository: nvcr.io/nvidia/mellanox
    version: network-operator-v25.7.0
    enableWebhook: false
  secondaryNetwork:
    cniPlugins:
      image: plugins
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
    multus:
      image: multus-cni
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
EOF
kubectl apply -f nic-cluster-policy.yaml

Define IP Pools#

Here we go with a very simple “flat” addressing scheme that is consistent with the rest of the network setup:

cat << EOF > ip-pool.yaml
---
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
  name: rail-1
  namespace: nvidia-network-operator
spec:
  subnet: 192.168.16.0/24
  perNodeBlockSize: 28
  nodeSelector:
    nodeSelectorTerms:
    - matchExpressions:
      - key: nvidia.com/gpu.product
        operator: In
        values:
        - NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition
---
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
  name: rail-2
  namespace: nvidia-network-operator
spec:
  subnet: 192.168.17.0/24
  perNodeBlockSize: 28
  nodeSelector:
    nodeSelectorTerms:
    - matchExpressions:
      - key: nvidia.com/gpu.product
        operator: In
        values:
        - NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition
---
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
  name: rail-3
  namespace: nvidia-network-operator
spec:
  subnet: 192.168.18.0/24
  perNodeBlockSize: 28
  nodeSelector:
    nodeSelectorTerms:
    - matchExpressions:
      - key: nvidia.com/gpu.product
        operator: In
        values:
        - NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition
---
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
  name: rail-4
  namespace: nvidia-network-operator
spec:
  subnet: 192.168.19.0/24
  perNodeBlockSize: 28
  nodeSelector:
    nodeSelectorTerms:
    - matchExpressions:
      - key: nvidia.com/gpu.product
        operator: In
        values:
        - NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition
EOF
kubectl apply -f ip-pool.yaml

Define SriovNetworks#

Next we define SriovNetworks, basically rails for the MultiNode applications:

cat << EOF > sriov-networks.yaml
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: rail-1
  namespace: nvidia-network-operator
spec:
  ipam: '{"type": "nv-ipam", "poolName": "rail-1", "poolType": "ippool"}'
  networkNamespace: default
  resourceName: rail-1
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: rail-2
  namespace: nvidia-network-operator
spec:
  ipam: '{"type": "nv-ipam", "poolName": "rail-2", "poolType": "ippool"}'
  networkNamespace: default
  resourceName: rail-2
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: rail-3
  namespace: nvidia-network-operator
spec:
  ipam: '{"type": "nv-ipam", "poolName": "rail-3", "poolType": "ippool"}'
  networkNamespace: default
  resourceName: rail-3
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: rail-4
  namespace: nvidia-network-operator
spec:
  ipam: '{"type": "nv-ipam", "poolName": "rail-4", "poolType": "ippool"}'
  networkNamespace: default
  resourceName: rail-4
EOF
kubectl apply -f sriov-networks.yaml

Define the Network Node Policy#

cat << EOF > sriov-network-node-policy.yaml
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: rail-1
  namespace: nvidia-network-operator
spec:
  eSwitchMode: legacy
  mtu: 9216
  nicSelector:
    pfNames: ["ens15f0np0"]
  numVfs: 8
  isRdma: true
  linkType: ETH
  resourceName: rail-1
  nodeSelector:
    nvidia.com/gpu.product: "NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition"
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: rail-2
  namespace: nvidia-network-operator
spec:
  eSwitchMode: legacy
  mtu: 9216
  nicSelector:
    pfNames: ["ens16f0np0"]
  numVfs: 8
  isRdma: true
  linkType: ETH
  resourceName: rail-2
  nodeSelector:
    nvidia.com/gpu.product: "NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition"
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: rail-3
  namespace: nvidia-network-operator
spec:
  eSwitchMode: legacy
  mtu: 9216
  nicSelector:
    pfNames: ["ens21f0np0"]
  numVfs: 8
  isRdma: true
  linkType: ETH
  resourceName: rail-3
  nodeSelector:
    nvidia.com/gpu.product: "NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition"
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: rail-4
  namespace: nvidia-network-operator
spec:
  eSwitchMode: legacy
  mtu: 9216
  nicSelector:
    pfNames: ["ens20f0np0"]
  numVfs: 8
  isRdma: true
  linkType: ETH
  resourceName: rail-4
  nodeSelector:
    nvidia.com/gpu.product: "NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition"
EOF
kubectl apply -f sriov-network-node-policy.yaml

Spawn a Test DaemonSet to Test the Connectivity#

cat << EOF > test-ds.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: sample
spec:
  selector:
    matchLabels:
      name: sample
  template:
    metadata:
      annotations:
        k8s.v1.cni.cncf.io/networks: rail-1, rail-2, rail-3, rail-4
      labels:
        name: sample
    spec:
      nodeSelector:
        nvidia.com/gpu.product: "NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition"
      containers:
      - name: sample
        command: ["bash"]
        args: ["-c", "sleep infinity & wait"]
        lifecycle:
          preStop:
            exec:
              command: ["pkill","sleep"]
        image: docker.io/deepops/nccl-tests:2312
        resources:
          limits:
            nvidia.com/rail-1: "1"
            nvidia.com/rail-2: "1"
            nvidia.com/rail-3: "1"
            nvidia.com/rail-4: "1"
            nvidia.com/gpu: "2"
          requests:
            nvidia.com/rail-1: "1"
            nvidia.com/rail-2: "1"
            nvidia.com/rail-3: "1"
            nvidia.com/rail-4: "1"
            nvidia.com/gpu: "2"
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - IPC_LOCK
EOF
kubectl apply -f test-ds.yaml

In a pod of the DaemonSet above you should be able to see 4 additional interfaces, and you should be able to ping other interfaces in every other pod of the same DaemonSet across the 4 rails.