What can I help you with?

RDG for DPF Zero Trust (DPF-ZT)

Created on Apr 20, 2025

Scope

This Reference Deployment Guide (RDG) provides comprehensive instructions for deploying the NVIDIA DOCA Platform Framework (DPF) on high-performance, bare-metal infrastructure in Zero-Trust mode. It focuses on the setup and use of DPU-based services on NVIDIA® BlueField®-3 DPUs to deliver secure, isolated, and hardware-accelerated environments.

The guide is intended for experienced system administrators, systems engineers, and solution architects who build highly secure bare-metal environments using NVIDIA BlueField DPUs for acceleration, isolation, and infrastructure offload.

Note
  • This reference implementation, as the name implies, is a specific, opinionated deployment example designed to address the use case described above.

  • Although other approaches may exist for implementing similar solutions, this document provides a detailed guide for this specific method.

Abbreviations and Acronyms

Term

Definition

Term

Definition

BFB

BlueField Bootstream

NGC

NVIDIA GPU Cloud

DOCA

Data Center Infrastructure-on-a-Chip Architecture

NFS

Network File System

DPF

DOCA Platform Framework

OOB

Out-of-Band

DPU

Data Processing Unit

PF

Physical Function

K8S

Kubernetes

RDG

Reference Deployment Guide

KVM

Kernel-based Virtual Machine

RDMA

Remote Direct Memory Access

MAAS

Metal as a Service

RoCE

RDMA over Converged Ethernet

MTU

Maximum Transmission Unit

ZT

Zero Trust

Introduction

The NVIDIA BlueField-3 Data Processing Unit (DPU) is a 400 Gb/s infrastructure compute platform designed for line-rate processing of software-defined networking, storage, and cybersecurity workloads. It combines powerful compute resources, high-speed networking, and advanced programmability to deliver hardware-accelerated, software-defined solutions for modern data centers.

NVIDIA DOCA unleashes the full potential of the BlueField platform by enabling rapid development of applications and services that offload, accelerate, and isolate data center workloads.

However, deploying and managing DPUs, especially at scale, presents operational challenges. Without a robust provisioning and orchestration system, tasks such as lifecycle management, service deployment, and network configuration for service function chaining (SFC) can quickly become complex and error prone. This is where the DOCA Platform Framework (DPF) comes into play.

DPF automates the full DPU lifecycle, and simplifies advanced network configurations. With DPF, services can be deployed seamlessly, allowing for efficient offloading and intelligent routing of traffic through the DPU data plane.

By leveraging DPF, users can scale and automate DPU management across Bare Metal, Virtual, and Kubernetes customer environments - optimizing performance while simplifying operations.

DPF supports multiple deployment models. This guide focuses on the Zero Trust bare-metal deployment model. In this scenario:

  • The DPU is managed through its Baseboard Management Controller (BMC)
  • All management traffic occurs over the DPU's out-of-band (OOB) network
  • The host is considered as an untrusted entity towards the data center network. The DPU acts as a barrier between the host and the network.
  • The host sees the DPU as a standard NIC, with no access to the internal DPU management plane (Zero Trust Mode)

This Reference Deployment Guide (RDG) provides a step-by-step example for installing DPF in Zero-Trust mode. It also includes practical demonstrations of performance optimization, validated using standard RDMA and TCP workloads.

As part of the reference implementation, open-source components outside the scope of DPF (e.g., MAAS, pfSense, Kubespray) are used to simulate a realistic customer deployment environment. The guide includes the full end-to-end deployment process, including:

  • Infrastructure provisioning
  • DPF deployment
  • DPU provisioning (redfish)
  • Service configuration and deployment
  • Service chaining.

References

    Solution Architecture

    Key Components and Technologies

    • NVIDIA BlueField® Data Processing Unit (DPU)

      The NVIDIA® BlueField® data processing unit (DPU) ignites unprecedented innovation for modern data centers and supercomputing clusters. With its robust compute power and integrated software-defined hardware accelerators for networking, storage, and security, BlueField creates a secure and accelerated infrastructure for any workload in any environment, ushering in a new era of accelerated computing and AI.

    • NVIDIA DOCA Software Framework

      NVIDIA DOCA™ unlocks the potential of the NVIDIA® BlueField® networking platform. By harnessing the power of BlueField DPUs and SuperNICs, DOCA enables the rapid creation of applications and services that offload, accelerate, and isolate data center workloads. It lets developers create software-defined, cloud-native, DPU- and SuperNIC-accelerated services with zero-trust protection, addressing the performance and security demands of modern data centers.

    • NVIDIA ConnectX SmartNICs

      10/25/40/50/100/200 and 400G Ethernet Network Adapters

      The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.

      NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.

    • NVIDIA LinkX Cables

      The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

    • NVIDIA Spectrum Ethernet Switches

      Flexible form-factors with 16 to 128 physical ports, supporting 1GbE through 400GbE speeds.

      Based on a ground-breaking silicon technology optimized for performance and scalability, NVIDIA Spectrum switches are ideal for building high-performance, cost-effective, and efficient Cloud Data Center Networks, Ethernet Storage Fabric, and Deep Learning Interconnects.

      NVIDIA combines the benefits of NVIDIA Spectrum switches, based on an industry-leading application-specific integrated circuit (ASIC) technology, with a wide variety of modern network operating system choices, including NVIDIA Cumulus® Linux , SONiC and NVIDIA Onyx®.

    • NVIDIA Cumulus Linux

      NVIDIA® Cumulus® Linux is the industry's most innovative open network operating system that allows you to automate, customize, and scale your data center network like no other.

    • NVIDIA Network Operator

      The NVIDIA Network Operator simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster. The operator automatically installs the required host networking software - bringing together all the needed components to provide high-speed network connectivity. These components include the NVIDIA networking driver, Kubernetes device plugin, CNI plugins, IP address management (IPAM) plugin and others. The NVIDIA Network Operator works in conjunction with the NVIDIA GPU Operator to deliver high-throughput, low-latency networking for scale-out, GPU computing clusters.

    • Kubernetes

      Kubernetes is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.

    • Kubespray

      Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:

      • A highly available cluster
      • Composable attributes
      • Support for most popular Linux distributions

    Solution Design

    Solution Logical Design

    The logical design includes the following components:

    • 1 x Hypervisor node (KVM-based) with ConnectX-7:

      • 1 x Firewall VM
      • 1 x Jump Node VM
      • 1 x MaaS VM
      • 3 x K8s Master VMs running all K8s management components
    • 2 x Worker nodes (PCI Gen5), each with a 1 x BlueField-3 NIC
    • Single High-Speed (HS) switch
    • 1 Gb Host Management network
    image-2025-5-26_17-49-56-version-1-modificationdate-1748270995763-api-v2.png

    Firewall Design

    The pfSense firewall in this solution serves a dual purpose:

    • Firewall—provides an isolated environment for the DPF system, ensuring secure operations
    • Router—enables Internet access for the management network

    Port-forwarding rules for SSH and RDP are configured on the firewall to route traffic to the jump node’s IP address in the host management network. From the jump node, administrators can manage and access various devices in the setup, as well as handle the deployment of the Kubernetes (K8s) cluster and DPF components.
    The following diagram illustrates the firewall design used in this solution:

    image-2025-5-7_10-44-2-1-version-1-modificationdate-1748173964367-api-v2.png

    Software Stack Components

    image-2025-5-16_12-35-45-1-version-1-modificationdate-1748173963133-api-v2.png

    Warning

    Make sure to use the exact same versions for the software stack as described above.

    Bill of Materials

    image-2025-5-26_9-32-25-1-version-1-modificationdate-1748241146447-api-v2.png

    Deployment and Configuration

    Node and Switch Definitions

    These are the definitions and parameters used for deploying the demonstrated fabric:

    Switches Ports Usage

    Hostname

    Rack ID

    Ports

    mgmt-switch

    1

    swp1-3

    hs-switch

    1

    swp1-5

    Hosts

    Rack

    Server Type

    Server Name

    Switch Port

    IP and NICs

    Default Gateway

    Rack1

    Hypervisor Node

    hypervisor

    mgmt-switch: swp1

    hs-switch: swp1

    lab-br (interface eno1): Trusted LAN IP

    mgmt-br (interface eno2): -

    hs-br (interface enp1s0): -

    Trusted LAN GW

    Rack1

    Firewall (Virtual)

    fw

    -

    WAN (lab-br): Trusted LAN IP

    LAN (mgmt-br): 10.0.110.254/24

    OPT1(hs-br): 172.169.50.1/30

    Trusted LAN GW

    Rack1

    Jump Node (Virtual)

    jump

    -

    enp1s0: 10.0.110.253/24

    10.0.110.254

    Rack1

    MaaS (Virtual)

    maas

    -

    enp1s0: 10.0.110.252/24

    10.0.110.254

    Rack1

    Master Node

    (Virtual)

    master1

    -

    enp1s0: 10.0.110.1/24

    10.0.110.254

    Rack1

    Master Node

    (Virtual)

    master2

    -

    enp1s0: 10.0.110.2/24

    10.0.110.254

    Rack1

    Master Node

    (Virtual)

    master3

    -

    enp1s0: 10.0.110.3/24

    10.0.110.254

    Rack1

    Worker Node

    worker1

    mgmt-switch: swp2(DPU OOB)

    hs-switch: swp2-swp3

    dpubmc: 10.0.110.21/24

    ens1f0np0/ens1f1np1: 10.0.120.0/22

    10.0.110.254

    Rack1

    Worker Node

    worker2

    mgmt-switch: swp3(DPU OOB)

    hs-switch: swp4-swp5

    dpubmc: 10.0.110.22/24

    ens1f0np0/ens1f1np1: 10.0.120.0/22

    10.0.110.254

    Wiring

    Hypervisor Node

    image-2025-5-25_14-59-46-1-version-1-modificationdate-1748174386423-api-v2.png

    K8s Worker Node

    image-2025-5-21_18-45-28-1-version-1-modificationdate-1748173956890-api-v2.png

    Fabric Configuration

    Updating Cumulus Linux

    As a best practice, make sure to use the latest released Cumulus Linux NOS version.

    For information on how to upgrade Cumulus Linux, refer to the Cumulus Linux User Guide.

    Configuring the Cumulus Linux Switch

    The SN3700 switch (hs-switch), is configured as follows:

    SN3700 Switch Console

    Copy
    Copied!
                

    nv set evpn enable on nv set interface eth0 ip address dhcp nv set interface eth0 type eth nv set interface lo ip address 11.0.0.101/32 nv set interface lo type loopback nv set interface swp1-5 link state up nv set interface swp1-5 type swp nv set interface swp1 ip address 172.169.50.2/30 nv set service ntp mgmt server 0.cumulusnetworks.pool.ntp.org nv set service ntp mgmt server 1.cumulusnetworks.pool.ntp.org nv set service ntp mgmt server 2.cumulusnetworks.pool.ntp.org nv set service ntp mgmt server 3.cumulusnetworks.pool.ntp.org nv set system aaa class nvapply action allow nv set system aaa class nvapply command-path / permission all nv set system aaa class nvshow action allow nv set system aaa class nvshow command-path / permission ro nv set system aaa class sudo action allow nv set system aaa class sudo command-path / permission all nv set system aaa role nvue-admin class nvapply nv set system aaa role nvue-monitor class nvshow nv set system aaa role system-admin class nvapply nv set system aaa role system-admin class sudo nv set system aaa user cumulus full-name cumulus,,, nv set system aaa user cumulus hashed-password '*' nv set system aaa user cumulus role system-admin nv set system api state enabled nv set system config auto-save state enabled nv set system reboot mode cold nv set system ssh-server state enabled nv config apply -y

    The SN2201 switch (mgmt-switch) is configured as follows:

    SN2201 Switch Console

    Copy
    Copied!
                

    nv set interface swp1-3 link state up nv set interface swp1-3 type swp nv set interface swp1-3 bridge domain br_default nv set bridge domain br_default untagged 1 nv config apply nv config save -y

    Host Configuration

    Warning

    All worker nodes must have the same PCIe placement for the BlueField-3 NIC and must display the same interface name.

    Hypervisor Installation and Configuration

    The hypervisor used in this Reference Deployment Guide (RDG) is based on Ubuntu 24.04 with KVM.

    While this document does not detail the KVM installation process, it is important to note that the setup requires the following ISOs to deploy the Firewall, Jump, and MaaS virtual machines (VMs):

    • Ubuntu 24.04
    • pfSense-CE-2.7.2

    To implement the solution, three Linux bridges must be created on the hypervisor:

    Note

    Ensure a DHCP record is configured for the lab-br bridge interface in your trusted LAN to assign it an IP address.

    • lab-br – connects the Firewall VM to the trusted LAN.
    • mgmt-br – Connects the various VMs to the host management network.
    • hs-br – Connects the Firewall VM to the high-speed network.

    Additionally, an MTU of 9000 must be configured on the management and high-speed bridges ( mgmt-br and hs-br ) as well as their uplink interfaces to ensure optimal performance.

    Hypervisor netplan configuration

    Copy
    Copied!
                

    network: ethernets: eno1: dhcp4: false eno2: dhcp4: false mtu: 9000 ens2f0np0: dhcp4: false mtu: 9000 bridges: lab-br: interfaces: [eno1] dhcp4: true mgmt-br: interfaces: [eno2] dhcp4: false mtu: 9000 hs-br: interfaces: [ens2f0np0] dhcp4: false mtu: 9000 version: 2

    Apply the configuration:

    Hypervisor Console

    Copy
    Copied!
                

    $ sudo netplan apply

    Prepare Infrastructure Servers

    Firewall VM - pfSense Installation and Interface Configuration

    Download the pfSense CE (Community Edition) ISO to your hypervisor and proceed with the software installation.

    Suggested spec:

    • vCPU: 2
    • RAM: 2GB
    • Storage: 10GB
    • Network interfaces

      • Bridge device connected to lab-br
      • Bridge device connected to mgmt-br
      • Bridge device connected to hs-br

    The Firewall VM must be connected to all three Linux bridges on the hypervisor. Before beginning the installation, ensure that three virtual network interfaces of type "Bridge device" are configured. Each interface should be connected to a different bridge (lab-br, mgmt-br, and hs-br) as illustrated in the diagram below.

    kvm_config_vm_network-version-1-modificationdate-1748502258280-api-v2.png

    After completing the installation, the setup wizard displays a menu with several options, such as "Assign Interfaces" and "Reboot System." During this phase, you must configure the network interfaces for the Firewall VM.

    1. Select Option 2: "Set interface(s) IP address" and configure the interfaces as follows:

      • WAN (lab-br) – Trusted LAN IP (Static/DHCP)
      • LAN (mgmt-br) – Static IP 10.0.110.254/24
      • OPT1 (hs-br) – Static IP 172.169.50.1/30
    2. Once the interface configuration is complete, use a web browser within the host management network to access the Firewall web interface and finalize the configuration.

    Next, proceed with installing the Jump VM. This VM serves as a platform for running a browser for accessing the firewall’s web interface (UI) for post-installation configuration.

    Jump VM

    Suggested specifications:

    • vCPU: 4
    • RAM: 8GB
    • Storage: 25GB
    • Network interface: Bridge device, connected to mgmt-br

    Procedure:

    1. Install standard Ubuntu 24.04 on each host . Use the following login credentials across all nodes in this deployment:

      Username

      Password

      depuser

      user

    1. Enable internet connectivity and DNS resolution by creating the following Netplan configuration:

      Note

      Use 10.0.110.254 as a temporary DNS nameserver until the MaaS VM is installed and configured. After completing the MaaS installation, update the Netplan file to replace this address with the MaaS IP: 10.0.110.252.

      Jump Node netplan

      Copy
      Copied!
                  

      network: ethernets: enp1s0: dhcp4: false addresses: [10.0.110.253/24] nameservers: search: [dpf.rdg.local.domain] addresses: [10.0.110.254] routes: - to: default via: 10.0.110.254 version: 2

    2. Apply the configuration:

      Jump Node Console

      Copy
      Copied!
                  

      depuser@jump:~$ sudo netplan apply

    3. Update and upgrade the system:

      Jump Node Console

      Copy
      Copied!
                  

      depuser@jump:~$ sudo apt update -y depuser@jump:~$ sudo apt upgrade -y

    4. Install and configure the Xfce desktop environment and XRDP (complementary packages for RDP):

      Jump Node Console

      Copy
      Copied!
                  

      depuser@jump:~$ sudo apt install -y xfce4 xfce4-goodies depuser@jump:~$ sudo apt install -y lightdm-gtk-greeter depuser@jump:~$ sudo apt install -y xrdp depuser@jump:~$ echo "xfce4-session" | tee .xsession depuser@jump:~$ sudo systemctl restart xrdp

    5. Install Firefox for accessing the Firewall web interface:

      Jump Node Console

      Copy
      Copied!
                  

      $ sudo apt install -y firefox

    6. Install and configure an NFS server with the /mnt/dpf_share directory:

      Jump Node Console

      Copy
      Copied!
                  

      $ sudo apt install -y nfs-server $ sudo mkdir -m 777 /mnt/dpf_share $ sudo vi /etc/exports

    7. Add the following line to /etc/exports:

      Jump Node Console

      Copy
      Copied!
                  

      /mnt/dpf_share 10.0.110.0/24(rw,sync,no_subtree_check)

    8. Restart the NFS server:

      Jump Node Console

      Copy
      Copied!
                  

      $ sudo systemctl restart nfs-server

    9. Create the directory bfb under /mnt/dpf_share with the same permissions as the parent directory:

      Jump Node Console

      Copy
      Copied!
                  

      $ sudo mkdir -m 777 /mnt/dpf_share/bfb

    10. Generate an SSH key pair for depuser in the jump node. These keys will later be imported to the admin user in MaaS to enable password-less login to the provisioned servers):

      Jump Node Console

      Copy
      Copied!
                  

      depuser@jump:~$ ssh-keygen -t rsa

    Firewall VM – Web Configuration

    From your Jump node, open a Firefox web browser and navigate to the pfSense web UI (http://10.0.110.254. The default login credentials are admin/pfsense). The login page should appear as follows:

    Note

    The IP addresses from the trusted LAN network under "DNS servers" and "Interfaces - WAN" are blurred.

    image-2025-2-12_16-12-2-1-version-1-modificationdate-1748173993137-api-v2.png

    Configure the following settings:

    Note

    The following screenshots display only a part of the configuration view. Make sure to not miss any of the steps mentioned below!

    • Interfaces

      • WAN—Mark “Enable interface”, unmark “Block private networks and loopback addresses”, “MTU”: 9000

    image-2025-5-16_14-18-48-version-1-modificationdate-1748173962817-api-v2.png

      • LAN—Mark “Enable interface”, “IPv4 configuration type”: “MTU”: 9000, Static IPv4 ("IPv4 Address": 10.0.110.254/24, "IPv4 Upstream Gateway": None)

    image-2025-5-16_14-22-11-version-1-modificationdate-1748173962487-api-v2.png

      • OPT1—Mark “Enable interface”, “IPv4 configuration type”: “MTU”: 9000, Static IPv4 ("IPv4 Address": 172.169.50.1/30, "IPv4 Upstream Gateway": None)

    image-2025-5-16_14-23-9-version-1-modificationdate-1748173962117-api-v2.png

    • Firewall:

      • NAT -> Port Forward -> Add rule -> “Interface”: WAN, “Address Family”: IPv4, “Protocol”: TCP, “Destination”: WAN address, “Destination port range”: (“From port”: SSH, “To port”: SSH), “Redirect target IP”: (“Type”: Address or Alias, “Address”: 10.0.110.253), “Redirect target port”: SSH, “Description”: NAT SSH

        image-2025-5-16_14-26-40-version-1-modificationdate-1748173961767-api-v2.png

      • NAT -> Port Forward -> Add rule -> “Interface”: WAN, “Address Family”: IPv4, “Protocol”: TCP, “Destination”: WAN address, “Destination port range”: (“From port”: MS RDP, “To port”: MS RDP), “Redirect target IP”: (“Type”: Address or Alias, “Address”: 10.0.110.253), “

        image-2025-5-16_14-27-16-version-1-modificationdate-1748173961473-api-v2.png
        image-2025-5-16_14-29-23-version-1-modificationdate-1748173960900-api-v2.png

      • Rules -> OPT1 -> Add rule -> “Action”: Pass , “Interface”: OPT1 , “Address Family”: IPv4+IPv6 , “Protocol”: Any , “Source”: Any , “Destination”: Any

        image-2025-5-16_14-30-22-version-1-modificationdate-1748173960577-api-v2.png

    MaaS VM

    Suggested specifications:

    • vCPU: 4
    • RAM: 4 GB
    • Storage: 100 GB
    • Network interface: Bridge device, connected to mgmt-br

    Procedure:

    1. Perform a regular Ubuntu installation on the MaaS VM.
    2. Create the following Netplan configuration to enable internet connectivity and DNS resolution:

      Note

      Use 10.0.110.254 as a temporary DNS nameserver. After the MaaS installation, replace this with the MaaS IP address (10.0.110.252) in both the Jump and MaaS VM Netplan files.

      MaaS netplan

      Copy
      Copied!
                  

      network: ethernets: enp1s0: dhcp4: false addresses: [10.0.110.252/24] nameservers: search: [dpf.rdg.local.domain] addresses: [10.0.110.254] routes: - to: default via: 10.0.110.254 version: 2

    3. Apply the netplan configuration:

      MaaS Console

      Copy
      Copied!
                  

      depuser@maas:~$ sudo netplan apply

    4. Update and upgrade the system:

      MaaS Console

      Copy
      Copied!
                  

      depuser@maas:~$ sudo apt update -y depuser@maas:~$ sudo apt upgrade -y

    5. Install PostgreSQL and configure the database for MaaS:

      MaaS Console

      Copy
      Copied!
                  

      $ sudo -i # apt install -y postgresql # systemctl disable --now systemd-timesyncd # export MAAS_DBUSER=maasuser # export MAAS_DBPASS=maaspass # export MAAS_DBNAME=maas # sudo -i -u postgres psql -c "CREATE USER \"$MAAS_DBUSER\" WITH ENCRYPTED PASSWORD '$MAAS_DBPASS'" # sudo -i -u postgres createdb -O "$MAAS_DBUSER" "$MAAS_DBNAME"

    6. Install MaaS:

      MaaS Console

      Copy
      Copied!
                  

      # snap install maas

    7. Initialize MaaS:

      MaaS Console

      Copy
      Copied!
                  

      # maas init region+rack --maas-url http://10.0.110.252:5240/MAAS --database-uri "postgres://$MAAS_DBUSER:$MAAS_DBPASS@localhost/$MAAS_DBNAME"

    8. Create an admin account:

      MaaS Console

      Copy
      Copied!
                  

      # maas createadmin --username admin --password admin --email admin@example.com

    9. Save the admin API key:

      MaaS Console

      Copy
      Copied!
                  

      # maas apikey --username admin > admin-apikey

    10. Log in to the MaaS server:

      MaaS Console

      Copy
      Copied!
                  

      # maas login admin http://localhost:5240/MAAS "$(cat admin-apikey)"

    11. Configure MaaS (Substitute <Trusted_LAN_NTP_IP> and <Trusted_LAN_DNS_IP> with the IP addresses in your environment):

      MaaS Console

      Copy
      Copied!
                  

      # maas admin domain update maas name="dpf.rdg.local.domain" # maas admin maas set-config name=ntp_servers value="<Trusted_LAN_NTP_IP>" # maas admin maas set-config name=network_discovery value="disabled" # maas admin maas set-config name=upstream_dns value="<Trusted_LAN_DNS_IP>" # maas admin maas set-config name=dnssec_validation value="no" # maas admin maas set-config name=default_osystem value="ubuntu"

    12. Define and configure IP ranges and subnets:

      MaaS Console

      Copy
      Copied!
                  

      # maas admin ipranges create type=dynamic start_ip="10.0.110.51" end_ip="10.0.110.120" # maas admin ipranges create type=dynamic start_ip="10.0.110.201" end_ip="10.0.110.240" # maas admin ipranges create type=reserved start_ip="10.0.110.10" end_ip="10.0.110.10" comment="c-plane VIP" # maas admin ipranges create type=reserved start_ip="10.0.110.200" end_ip="10.0.110.200" comment="kamaji VIP" # maas admin ipranges create type=reserved start_ip="10.0.110.251" end_ip="10.0.110.254" comment="dpfmgmt" # maas admin vlan update 0 untagged dhcp_on=True primary_rack=maas mtu=9000 # maas admin dnsresources create fqdn=kube-vip.dpf.rdg.local.domain ip_addresses=10.0.110.10 # maas admin dnsresources create fqdn=jump.dpf.rdg.local.domain ip_addresses=10.0.110.253 # maas admin dnsresources create fqdn=fw.dpf.rdg.local.domain ip_addresses=10.0.110.254 # maas admin fabrics create Success. Machine-readable output follows: { "class_type": null, "name": "fabric-1", "id": 1, ... # maas admin subnets create name="fake-dpf" cidr="20.20.20.0/24" fabric=1

    13. Complete MaaS setup:

      1. Connect to the Jump node GUI and access the MaaS UI at http://10.0.110.252:5240/MAAS.
      2. On the first page, verify the "Region Name" and "DNS Forwarder," then continue.
      3. On the image selection page, select Ubuntu 24.04 LTS (amd64) and sync the image.

        maas_OS_Image_Mix_Good-version-1-modificationdate-1748173972987-api-v2.png

      4. Import the previously generated SSH key (id_rsa.pub) for the depuser into the MaaS admin user profile and finalize the setup.

        import_sshkey-version-1-modificationdate-1748173982207-api-v2.png

    14. Configure DHCP snippets:

      1. Navigate to Settings → DHCP Snippets → Add Snippet.
      2. Fill in the following fields:

        1. Name: dpu-bmc-oob-mgmt
        2. Toggle on "Enabled"
        3. Type: IP Range
        4. Applies to: 10.0.110.201-10.0.110.240
      3. Fill in the content of the DHCP snippet field with the following (replace the MAC address with the appropriate value for your DPU workers' BMC and OOB interface MAC) addresses:

        DHCP snippet

        Copy
        Copied!
                    

        # dpuworker1 host dpuworker1-bmc { # # Node DHCP snippets #   hardware ethernet 58:a2:e1:73:6a:0b; fixed-address 10.0.110.201; } host dpuworker1-oob{ # # Node DHCP snippets #   hardware ethernet 58:a2:e1:73:6a:0a; fixed-address 10.0.110.221; } # dpuworker2 host dpuworker2-bmc { # # Node DHCP snippets #   hardware ethernet 58:a2:e1:73:6a:7d; fixed-address 10.0.110.202; } host dpuworker2-oob{ # # Node DHCP snippets #   hardware ethernet 58:a2:e1:73:6a:7c; fixed-address 10.0.110.222; }

    15. Go to Settings → Deploy, set "Default OS release" to Ubuntu 24.04 LTS Noble Numbat, and save.

      maas_os-version_deployment-version-1-modificationdate-1748173973650-api-v2.png

    16. Update the DNS nameserver IP address in the Netplan files for both the Jump and MaaS VMs from 10.0.110.254 to 10.0.110.252, then reapply the configuration.

    K8s Master VMs

    Suggested specifications:

    • vCPU: 8
    • RAM: 16GB
    • Storage: 100GB
    • Network interface: Bridge device, connected to mgmt-br
    1. Before provisioning the Kubernetes (K8s) Master VMs with MaaS, create the required virtual disks with empty storage. Use the following one-liner to create three 100 GB QCOW2 virtual disks:

      Hypervisor Console

      Copy
      Copied!
                  

      $ for i in $(seq 1 3); do qemu-img create -f qcow2 /var/lib/libvirt/images/master$i.qcow2 100G; done

      This command generates the following disks in the /var/lib/libvirt/images/ directory:

      • master1.qcow2
      • master2.qcow2
      • master3.qcow2
    2. Configure VMs in virt-manager:

      1. Open virt-manager and create three virtual machines:

        • Assign the corresponding virtual disk (master1.qcow2, master2.qcow2, or master3.qcow2) to each VM.
        • Configure each VM with the suggested specifications (vCPU, RAM, storage, and network interface).
      2. During the VM setup, ensure the NIC is selected under the Boot Options tab. This ensures the VMs can PXE boot for MaaS provisioning.
      3. Once the configuration is complete, shut down all the VMs.
    3. After the VMs are created and configured, proceed to provision them via the MaaS interface. MaaS will handle the OS installation and further setup as part of the deployment process.

    Provision Master VMs Using MaaS

    Install virsh and Set Up SSH Access

    1. SSH to the MaaS VM from the Jump node:

      MaaS Console

      Copy
      Copied!
                  

      depuser@jump:~$ ssh maas depuser@maas:~$ sudo -i

    2. Install the virsh client to communicate with the hypervisor:

      MaaS Console

      Copy
      Copied!
                  

      # apt install -y libvirt-clients

    3. Generate an SSH key for the root user and copy it to the hypervisor user in the libvirtd group:

      MaaS Console

      Copy
      Copied!
                  

      # ssh-keygen -t rsa # ssh-copy-id ubuntu@<hypervisor_MGMT_IP>

    4. Verify SSH access and virsh communication with the hypervisor:

      MaaS Console

      Copy
      Copied!
                  

      # virsh -c qemu+ssh://ubuntu@<hypervisor_MGMT_IP>/system list --all

      Expected output:

      MaaS Console

      Copy
      Copied!
                  

      Id Name State ------------------------------ 1 fw running 2 jump running 3 maas running - master1 shut off - master2 shut off - master3 shut off

    5. Copy the SSH key to the required MaaS directory (for snap-based installations):

      MaaS Console

      Copy
      Copied!
                  

      # mkdir -p /var/snap/maas/current/root/.ssh # cp .ssh/id_rsa* /var/snap/maas/current/root/.ssh/

    Get MAC Addresses of the Master VMs


    Retrieve the MAC addresses of the Master VMs:

    MaaS Console

    Copy
    Copied!
                

    # for i in $(seq 1 3); do virsh -c qemu+ssh://ubuntu@<hypervisor_MGMT_IP>/system dumpxml master$i | grep 'mac address'; done

    Example output:

    MaaS Console

    Copy
    Copied!
                

    <mac address='52:54:00:a9:9c:ef'/> <mac address='52:54:00:19:6b:4d'/> <mac address='52:54:00:68:39:7f'/>

    Add Master VMs to MaaS

    1. Add the Master VMs to MaaS:

      Info

      Once added, MaaS will automatically start the newly added VMs commissioning (discovery and introspection).

      MaaS Console

      Copy
      Copied!
                  

      # maas admin machines create hostname=master1 architecture=amd64/generic mac_addresses='52:54:00:a9:9c:ef' power_type=virsh power_parameters_power_address=qemu+ssh://ubuntu@<hypervisor_MGMT_IP>/system power_parameters_power_id=master1 skip_bmc_config=1 testing_scripts=none Success. Machine-readable output follows: { "description": "", "status_name": "Commissioning", ... "status": 1, ...    "system_id": "c3seyq", ...     "fqdn": "master1.dpf.rdg.local.domain",    "power_type": "virsh", ... "status_message": "Commissioning", "resource_uri": "/MAAS/api/2.0/machines/c3seyq/" }   # maas admin machines create hostname=master2 architecture=amd64/generic mac_addresses='52:54:00:19:6b:4d' power_type=virsh power_parameters_power_address=qemu+ssh://ubuntu@<hypervisor_MGMT_IP>/system power_parameters_power_id=master2 skip_bmc_config=1 testing_scripts=none   # maas admin machines create hostname=master3 architecture=amd64/generic mac_addresses='52:54:00:68:39:7f' power_type=virsh power_parameters_power_address=qemu+ssh://ubuntu@<hypervisor_MGMT_IP>/system power_parameters_power_id=master3 skip_bmc_config=1 testing_scripts=none

    2. Repeat the command for master2 and master3 with their respective MAC addresses.
    3. Verify commissioning by waiting for the status to change to "Ready" in MaaS.

      maas_masters_commission_virsh_updated-version-1-modificationdate-1748173981013-api-v2.png

      After commissioning, the next phase is deployment (OS provisioning).

    Configure Master VMs Network


    To ensure persistence across reboots, assign a static IP address to the management interface of the master nodes.

    For each Master VM:

    1. Navigate to Network and click "actions" near the management interface (a small arrowhead pointing down), then select "Edit Physical".

      1. Configure as follows:

        1. Subnet: 10.0.110.0/24
        2. IP Mode: Static Assign
        3. Address: Assign 10.0.110.1 for master1, 10.0.110.2 for master2, and 10.0.110.3 for master3.

          image-2025-5-5_22-22-37-version-1-modificationdate-1748173966197-api-v2.png

    2. Save the interface settings for each VM.
    Deploy Master VMs Using Cloud-Init

    1. Use the following cloud-init script to configure the necessary software and ensure persistency:

      Master nodes cloud-init

      Copy
      Copied!
                  

      #cloud-config system_info: default_user: name: depuser passwd: "$6$jOKPZPHD9XbG72lJ$evCabLvy1GEZ5OR1Rrece3NhWpZ2CnS0E3fu5P1VcZgcRO37e4es9gmriyh14b8Jx8gmGwHAJxs3ZEjB0s0kn/" lock_passwd: false groups: [adm, audio, cdrom, dialout, dip, floppy, lxd, netdev, plugdev, sudo, video] sudo: ["ALL=(ALL) NOPASSWD:ALL"] shell: /bin/bash ssh_pwauth: True package_upgrade: true runcmd: - apt-get update - apt-get -y install nfs-common

    2. Deploy the master VMs:

      1. Select all three Master VMs → ActionsDeploy.
      2. Toggle Cloud-init user-data and paste the cloud-init script.
      3. Start the deployment and wait for the status to change to "Ubuntu 24.04 LTS".

        maas_master_vms_deployment_before-version-1-modificationdate-1748173973947-api-v2.png

        image-2025-5-5_22-24-35-version-1-modificationdate-1748173965903-api-v2.png

    Verify Deployment

    • SSH into the Master VMs from the Jump node:

      Jump Node Console

      Copy
      Copied!
                  

      depuser@jump:~$ ssh master1 depuser@master1:~$

    • Run sudo without a password:

      Master1 Console

      Copy
      Copied!
                  

      depuser@master1:~$ sudo -i root@master1:~#

    • Verify installed packages:

      Master1 Console

      Copy
      Copied!
                  

      root@master1:~# apt list --installed | egrep 'nfs-common' nfs-common/noble,now 1:2.6.4-3ubuntu5 amd64 [installed]

    • Reboot the Master VMs to complete the provisioning.

    Master1 Console

    Copy
    Copied!
                

    root@master1:~# reboot

    Repeat the verification commands for master2 andmaster3.

    K8s Cluster Deployment and Configuration

    Kubespray Deployment and Configuration

    In this solution, the Kubernetes (K8s) cluster is deployed using a modified Kubespray (based on tag v2.26.0) with a non-root depuser account from the Jump Node. The modifications in Kubespray are designed to meet the DPF prerequisites as described in the User Manual and facilitate cluster deployment and scaling.

    Our modified Kubespray installs Flannel CNI for the primary Kubernetes network.

    1. Download the modified Kubespray archive: modified_kubespray_v2.26.0.tar.gz.
    2. Extract the contents and navigate to the extracted directory:

      Jump Node Console

      Copy
      Copied!
                  

      $ tar -xzf /home/depuser/modified_kubespray_v2.26.0.tar.gz $ cd kubespray/ depuser@jump:~/kubespray$

    3. Verify that the network plugin is set to flannel and that kube_proxy_remove is set to false in the inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml file.

      inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

      Copy
      Copied!
                  

      [depuser@jump kubespray-2.26.0]$ vim inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml # Choose network plugin (cilium, calico, kube-ovn, weave or flannel. Use cni for generic cni plugin) # Can also be set to 'cloud', which lets the cloud provider setup appropriate routing kube_network_plugin: flannel .... # Kube-proxy proxyMode configuration. # Can be ipvs, iptables kube_proxy_remove: false kube_proxy_mode: ipvs .....

    4. Set the K8s API VIP address and DNS record. Replace it with your own IP address and DNS record if different:

      Jump Node Console

      Copy
      Copied!
                  

      depuser@jump:~/kubespray$ sed -i '/ #kube_vip_address:/s/.*/kube_vip_address: 10.0.110.10/' inventory/mycluster/group_vars/k8s_cluster/addons.yml depuser@jump:~/kubespray$ sed -i '/apiserver_loadbalancer_domain_name:/s/.*/apiserver_loadbalancer_domain_name: "kube-vip.dpf.rdg.local.domain"/' roles/kubespray-defaults/defaults/main/main.yml

    5. Install the necessary dependencies and set up the Python virtual environment:

      Jump Node Console

      Copy
      Copied!
                  

      depuser@jump:~/kubespray$ sudo apt -y install python3-pip jq python3.12-venv depuser@jump:~/kubespray$ python3 -m venv .venv depuser@jump:~/kubespray$ source .venv/bin/activate (.venv) depuser@jump:~/kubespray$ python3 -m pip install --upgrade pip (.venv) depuser@jump:~/kubespray$ pip install -U -r requirements.txt (.venv) depuser@jump:~/kubespray$ pip install ruamel-yaml

    6. Review and edit the inventory/mycluster/hosts.yaml file to define the cluster nodes. The following is the configuration for this deployment:

      inventory/mycluster/hosts.yaml

      Copy
      Copied!
                  

      all: hosts: master1: ansible_host: 10.0.110.1 ip: 10.0.110.1 access_ip: 10.0.110.1 node_labels: "k8s.ovn.org/zone-name": "master1" master2: ansible_host: 10.0.110.2 ip: 10.0.110.2 access_ip: 10.0.110.2 node_labels: "k8s.ovn.org/zone-name": "master2" master3: ansible_host: 10.0.110.3 ip: 10.0.110.3 access_ip: 10.0.110.3 node_labels: "k8s.ovn.org/zone-name": "master3"   children: kube_control_plane: hosts: master1: master2: master3: kube_node: hosts: etcd: hosts: master1: master2: master3: k8s_cluster: children: kube_control_plane:

    Deploying Cluster Using Kubespray Ansible Playbook

    1. Run the following command from the Jump Node to initiate the deployment process:

      Note

      Ensure you are in the Python virtual environment (.venv) when running the command.

      Jump Node Console

      Copy
      Copied!
                  

      (.venv) depuser@jump:~/kubespray$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

    2. It takes a while for this deployment to complete. Make sure there are no errors. Successful result example:

      image-2025-4-8_22-10-48-version-1-modificationdate-1748173996407-api-v2.png

      Tip

      It is recommended to keep the shell from which Kubespray has been running open, later on it will be useful when performing cluster scale out to add the worker nodes.

    K8s Deployment Verification

    To simplify managing the K8s cluster from the Jump Host, set up kubectl with bash auto-completion.

    1. Copy kubectl and the kubeconfig file from master1 to the Jump Host:

      Jump Node Console

      Copy
      Copied!
                  

      ## Connect to master1 depuser@jump:~$ ssh master1 depuser@master1:~$ cp /usr/local/bin/kubectl /tmp/ depuser@master1:~$ sudo cp /root/.kube/config /tmp/kube-config depuser@master1:~$ sudo chmod 644 /tmp/kube-config

    2. In another terminal tab, copy the files to the Jump Host:

      Jump Node Console

      Copy
      Copied!
                  

      depuser@jump:~$ scp master1:/tmp/kubectl /tmp/ depuser@jump:~$ sudo chown root:root /tmp/kubectl depuser@jump:~$ sudo mv /tmp/kubectl /usr/local/bin/ depuser@jump:~$ mkdir -p ~/.kube depuser@jump:~$ scp master1:/tmp/kube-config ~/.kube/config depuser@jump:~$ chmod 600 ~/.kube/config

    3. Enable bash auto-completion for kubectl:

      1. Verify if bash-completion is installed:

        Jump Node Console

        Copy
        Copied!
                    

        depuser@jump:~$ type _init_completion

        If installed, the output includes:

        Jump Node Console

        Copy
        Copied!
                    

        _init_completion is a function

      2. If not installed, install it:

        Jump Node Console

        Copy
        Copied!
                    

        depuser@jump:~$ sudo apt install -y bash-completion

      3. Set up the kubectl completion script:

        Jump Node Console

        Copy
        Copied!
                    

        depuser@jump:~$ kubectl completion bash | sudo tee /etc/bash_completion.d/kubectl > /dev/null depuser@jump:~$ bash

    4. Check the status of the nodes in the cluster:

      Jump Node Console

      Copy
      Copied!
                  

      depuser@jump:~$ kubectl get nodes

      Expected output:

      Jump Node Console

      Copy
      Copied!
                  

      NAME STATUS ROLES AGE VERSION master1 Ready control-plane 8m7s v1.30.4 master2 Ready control-plane 7m13s v1.30.4 master3 Ready control-plane 6m40s v1.30.4

    5. Check the pods in all namespaces:

      Jump Node Console

      Copy
      Copied!
                  

      depuser@jump:~$ kubectl get pods -A

      Expected output:

      Jump Node Console

      Copy
      Copied!
                  

      [depuser@setup5-jump ~]$ kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-776bb9db5d-2st6b 1/1 Running 0 5m58s kube-system coredns-776bb9db5d-kbklh 1/1 Running 0 5m53s kube-system dns-autoscaler-6ffb84bd6-cp466 1/1 Running 0 5m54s kube-system kube-apiserver-master1 1/1 Running 0 8m35s kube-system kube-apiserver-master2 1/1 Running 0 7m44s kube-system kube-apiserver-master3 1/1 Running 0 7m10s kube-system kube-controller-manager-master1 1/1 Running 1 8m35s kube-system kube-controller-manager-master2 1/1 Running 1 7m44s kube-system kube-controller-manager-master3 1/1 Running 1 7m10s kube-system kube-flannel-8r2dd 1/1 Running 0 6m22s kube-system kube-flannel-sq88x 1/1 Running 0 6m22s kube-system kube-flannel-xf9mn 1/1 Running 0 6m23s kube-system kube-proxy-4v7hn 1/1 Running 0 8m21s kube-system kube-proxy-6cdjc 1/1 Running 0 7m14s kube-system kube-proxy-tm2j4 1/1 Running 0 7m47s kube-system kube-scheduler-master1 1/1 Running 1 8m36s kube-system kube-scheduler-master2 1/1 Running 1 7m45s kube-system kube-scheduler-master3 1/1 Running 1 7m10s kube-system kube-vip-master1 1/1 Running 0 8m35s kube-system kube-vip-master2 1/1 Running 0 7m45s kube-system kube-vip-master3 1/1 Running 0 7m10s

    DPF Installation

    Software Prerequisites and Required Variables

    Start by installing the remaining software perquisites:

    Jump Node Console

    Copy
    Copied!
                

    ## Connect to master1 to copy helm client utility that was installed during kubespray deployment $ depuser@jump:~$ ssh master1 depuser@master1:~$ cp /usr/local/bin/helm /tmp/   ## In another tab depuser@jump:~$ scp master1:/tmp/helm /tmp/ depuser@jump:~$ sudo chown root:root /tmp/helm depuser@jump:~$ sudo mv /tmp/helm /usr/local/bin/   ## Verify that envsubst utility is installed depuser@jump:~$ which envsubst /usr/bin/envsubst

    Proceed to clone the doca-platform Git repository (and make sure to use tag v25.4.0):

    Jump Node Console

    Copy
    Copied!
                

    $ git clone https://github.com/NVIDIA/doca-platform.git $ cd doca-platform $ git checkout v25.4.0

    Change directory to the location of the hbn-only readme.md from where all the commands are run:

    Jump Node Console

    Copy
    Copied!
                

    $ cd docs/public/user-guides/hbn_only/

    Use the following file to define the required variables for the installation:

    Warning

    Replace the values for the variables in the following file with the values that fit your setup. Specifically, pay attention to DPU_P0 and DPUCLUSTER_INTERFACE.

    export_vars.env

    Copy
    Copied!
                

    ## IP Address for the Kubernetes API server of the target cluster on which DPF is installed. ## This should never include a scheme or a port. ## e.g. 10.10.10.10 export TARGETCLUSTER_API_SERVER_HOST=10.0.110.10   ## Port for the Kubernetes API server of the target cluster on which DPF is installed. export TARGETCLUSTER_API_SERVER_PORT=6443   ## Virtual IP used by the load balancer for the DPU Cluster. Must be a reserved IP from the management subnet and not allocated by DHCP. export DPUCLUSTER_VIP=10.0.110.200   ## DPU_P0 is the name of the first port of the DPU. This name must be the same on all worker nodes. export DPU_P0=ens1f0np0   ## Interface on which the DPUCluster load balancer will listen. Should be the management interface of the control plane node. export DPUCLUSTER_INTERFACE=eno1   # IP address to the NFS server used as storage for the BFB. export NFS_SERVER_IP=10.0.110.253   ## The repository URL for the NVIDIA Helm chart registry. ## Usually this is the NVIDIA Helm NGC registry. For development purposes, this can be set to a different repository. export HELM_REGISTRY_REPO_URL=https://helm.ngc.nvidia.com/nvidia/doca   ## The repository URL for the HBN container image. ## Usually this is the NVIDIA NGC registry. For development purposes, this can be set to a different repository. export HBN_NGC_IMAGE_URL=nvcr.io/nvidia/doca/doca_hbn   ## The DPF REGISTRY is the Helm repository URL for the DPF Operator. ## Usually this is the GHCR registry. For development purposes, this can be set to a different repository. export REGISTRY=https://helm.ngc.nvidia.com/nvidia/doca   ## The DPF TAG is the version of the DPF components which will be deployed in this guide. export TAG=v25.4.0   ## URL to the BFB used in the `bfb.yaml` and linked by the DPUSet. export BLUEFIELD_BITSTREAM="https://content.mellanox.com/BlueField/BFBs/Ubuntu22.04/bf-bundle-3.0.0-135_25.04_ubuntu-22.04_prod.bfb"

    Export environment variables for the installation:

    Jump Node Console

    Copy
    Copied!
                

    $ source export_vars.env

    DPF Operator Installation

    Cert-manager Installation

    Cert-manager is a powerful and extensible X.509 certificate controller for Kubernetes workloads. It obtains certificates from a variety of Issuers, both popular public Issuers as well as private ones. Cert-manager ensures certificates are valid and up-to-date, and it attempts to renew certificates at a configured time before expiration.

    In this deployment, Cert-manager is a prerequisite that provides certificates for webhooks used by DPF and its dependencies.

    Install Cert-manager using Helm. The following values will be used for the Helm chart installation:

    manifests/01-dpf-operator-installation/helm-values/cert-manager.yml

    Copy
    Copied!
                

    startupapicheck: enabled: false crds: enabled: true affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/master operator: Exists - matchExpressions: - key: node-role.kubernetes.io/control-plane operator: Exists tolerations: - operator: Exists effect: NoSchedule key: node-role.kubernetes.io/control-plane - operator: Exists effect: NoSchedule key: node-role.kubernetes.io/master cainjector: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/master operator: Exists - matchExpressions: - key: node-role.kubernetes.io/control-plane operator: Exists tolerations: - operator: Exists effect: NoSchedule key: node-role.kubernetes.io/control-plane - operator: Exists effect: NoSchedule key: node-role.kubernetes.io/master webhook: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/master operator: Exists - matchExpressions: - key: node-role.kubernetes.io/control-plane operator: Exists tolerations: - operator: Exists effect: NoSchedule key: node-role.kubernetes.io/control-plane - operator: Exists effect: NoSchedule key: node-role.kubernetes.io/master

    Run the following commands:

    Jump Node Console

    Copy
    Copied!
                

    $ helm repo add jetstack https://charts.jetstack.io --force-update $ helm upgrade --install --create-namespace --namespace cert-manager cert-manager jetstack/cert-manager --version v1.16.1 -f ./manifests/01-dpf-operator-installation/helm-values/cert-manager.yml   Release "cert-manager" does not exist. Installing it now. NAME: cert-manager LAST DEPLOYED: Tue Apr 8 13:40:48 2025 NAMESPACE: cert-manager STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: cert-manager v1.16.1 has been deployed successfully! ...

    Verify that all the pods in the Cert-manager namespace are in the Ready state:

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl wait --for=condition=ready --namespace cert-manager pods --all pod/cert-manager-6ffdf6c5f8-5sx4q condition met pod/cert-manager-cainjector-66b8577665-rgrlz condition met pod/cert-manager-webhook-5cb94cb7b6-c7lpz condition met

    Install a CSI to Back the DPUCluster etcd

    Download local-path-provisioner helm chart to your current working directory and create a NS for it:

    Jump Node Console

    Copy
    Copied!
                

    $ curl https://codeload.github.com/rancher/local-path-provisioner/tar.gz/v0.0.30 | tar -xz --strip=3 local-path-provisioner-0.0.30/deploy/chart/local-path-provisioner/ $ kubectl create ns local-path-provisioner

    The following values will be used for the installation:

    manifests/01-dpf-operator-installation/helm-values/local-path-provisioner.yml

    Copy
    Copied!
                

    tolerations: - operator: Exists effect: NoSchedule key: node-role.kubernetes.io/control-plane - operator: Exists effect: NoSchedule key: node-role.kubernetes.io/master

    Run the following command:

    Jump Node Console

    Copy
    Copied!
                

    $ helm install -n local-path-provisioner local-path-provisioner ./local-path-provisioner --version 0.0.30 -f ./manifests/01-dpf-operator-installation/helm-values/local-path-provisioner.yml   NAME: local-path-provisioner LAST DEPLOYED: Tue Apr 8 13:43:06 2025 NAMESPACE: local-path-provisioner STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: ...

    Ensure that the pod in the local-path-provisioner namespace is in the Ready state:

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl wait --for=condition=ready --namespace local-path-provisioner pods --all pod/local-path-provisioner-75f649c47c-rsvb8 condition met

    Create Storage Required by the DPF Operator

    The following YAML file defines storage (for the BFB images) that are required by the DPF operator.

    manifests/01-dpf-operator-installation/nfs-storage-for-bfb-dpf-ga.yaml

    Copy
    Copied!
                

    --- apiVersion: v1 kind: PersistentVolume metadata: name: bfb-pv spec: capacity: storage: 10Gi volumeMode: Filesystem accessModes: - ReadWriteMany nfs: path: /mnt/dpf_share/bfb server: $NFS_SERVER_IP persistentVolumeReclaimPolicy: Delete --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: bfb-pvc namespace: dpf-operator-system spec: accessModes: - ReadWriteMany resources: requests: storage: 10Gi volumeMode: Filesystem storageClassName: ""

    Run the following commands to first create the namespace for the DPF Operator, then substitute the environment variables using envsubst,and apply the YAML files:

    Jump Node Console

    Copy
    Copied!
                

    $ cat manifests/01-dpf-operator-installation/*.yaml | envsubst | kubectl apply -f -

    DPF Operator Deployment

    The DPF Operator Helm values are detailed in the following YAML file:

    manifests/01-dpf-operator-installation/helm-values/dpf-operator.yml

    Copy
    Copied!
                

    kamaji-etcd: persistentVolumeClaim: storageClassName: local-path node-feature-discovery: worker: extraEnvs: - name: "KUBERNETES_SERVICE_HOST" value: "$TARGETCLUSTER_API_SERVER_HOST" - name: "KUBERNETES_SERVICE_PORT" value: "$TARGETCLUSTER_API_SERVER_PORT"

    Run the following commands to substitute the environment variables and install the DPF Operator:

    Jump Node Console

    Copy
    Copied!
                

    $ helm repo add --force-update dpf-repository ${REGISTRY} $ helm repo update $ envsubst < ./manifests/01-dpf-operator-installation/helm-values/dpf-operator.yml | helm upgrade --install -n dpf-operator-system dpf-operator dpf-repository/dpf-operator --version=$TAG --values -     Release "dpf-operator" does not exist. Installing it now. coalesce.go:286: warning: cannot overwrite table with non table for dpf-operator.parca.server.tolerations (map[]) NAME: dpf-operator LAST DEPLOYED: Tue May 20 23:18:22 2025 NAMESPACE: dpf-operator-system STATUS: deployed REVISION: 1 TEST SUITE: None

    Verify the DPF Operator installation by ensuring the deployment is available and all the pods are ready:

    Note

    The following verification commands may need to be run multiple times to ensure the conditions are met.

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl rollout status deployment --namespace dpf-operator-system dpf-operator-controller-manager deployment "dpf-operator-controller-manager" successfully rolled out   $ kubectl wait --for=condition=ready --namespace dpf-operator-system pods --all pod/dpf-operator-argocd-application-controller-0 condition met pod/dpf-operator-argocd-redis-5bc74d76fc-v6l7m condition met pod/dpf-operator-argocd-repo-server-86c9454fc9-zqtqf condition met pod/dpf-operator-argocd-server-554d9f446-lntpv condition met pod/dpf-operator-controller-manager-67599cdcb7-5dchf condition met pod/dpf-operator-kamaji-6dcf4ccdfd-fg64w condition met pod/dpf-operator-kamaji-etcd-0 condition met pod/dpf-operator-kamaji-etcd-1 condition met pod/dpf-operator-kamaji-etcd-2 condition met pod/dpf-operator-maintenance-operator-666b88bfcd-p72nn condition met pod/dpf-operator-node-feature-discovery-gc-656b95dc48-gwtsb condition met pod/dpf-operator-node-feature-discovery-master-76d5695c7c-6kwfz condition met

    DPF System Installation

    This section involves creating the DPF system components and some basic infrastructure required for a functioning DPF-enabled cluster.

    The files define the DPFOperatorConfig to install the DPF System components, and the DPUCluster to serve as the Kubernetes control plane for DPU nodes.

    manifests/02-dpf-system-installation/operatorconfig.yaml

    Copy
    Copied!
                

    --- apiVersion: operator.dpu.nvidia.com/v1alpha1 kind: DPFOperatorConfig metadata: name: dpfoperatorconfig namespace: dpf-operator-system spec: kamajiClusterManager: disable: false provisioningController: bfbPVCName: bfb-pvc installInterface: installViaRedfish: # set this to the IP of one of your control plane node + 8080 bfbRegistryAddress: "10.0.110.1:8080" dmsTimeout: 900 staticClusterManager: disable: false  networking: controlPlaneMTU: 9216 highSpeedMTU: 9216

    manifests/02-dpf-system-installation/dpucluster.yaml

    Copy
    Copied!
                

    --- apiVersion: provisioning.dpu.nvidia.com/v1alpha1 kind: DPUCluster metadata: name: dpu-cplane-tenant1 namespace: dpu-cplane-tenant1 spec: type: kamaji maxNodes: 10 version: v1.30.2 clusterEndpoint: # deploy keepalived instances on the nodes that match the given nodeSelector. keepalived: # interface on which keepalived will listen. Should be the oob interface of the control plane node. interface: $DPUCLUSTER_INTERFACE # Virtual IP reserved for the DPU Cluster load balancer. Must not be allocatable by DHCP. vip: $DPUCLUSTER_VIP # virtualRouterID must be in range [1,255], make sure the given virtualRouterID does not duplicate with any existing keepalived process running on the host virtualRouterID: 126 nodeSelector: node-role.kubernetes.io/control-plane: ""

    Create a namespace for the Kubernetes control plane of the DPU nodes:

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl create ns dpu-cplane-tenant1

    Apply the previous YAML files:

    Jump Node Console

    Copy
    Copied!
                

    $ cat manifests/02-dpf-system-installation/operatorconfig.yaml | envsubst | kubectl apply -f - $ cat manifests/02-dpf-system-installation/dpucluster.yaml | envsubst | kubectl apply -f -

    Verify the DPF system by ensuring that the provisioning and DPUService controller manager deployments are available, all other deployments in the DPF Operator system are available, and that the DPUCluster is ready for nodes to join.

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl rollout status deployment --namespace dpf-operator-system dpf-provisioning-controller-manager dpuservice-controller-manager deployment "dpf-provisioning-controller-manager" successfully rolled out deployment "dpuservice-controller-manager" successfully rolled out   $ kubectl rollout status deployment --namespace dpf-operator-system deployment "dpf-provisioning-controller-manager" successfully rolled out deployment "dpuservice-controller-manager" successfully rolled out [depuser@setup5-jump hbn_only]$ kubectl rollout status deployment --namespace dpf-operator-system deployment "dpf-operator-argocd-applicationset-controller" successfully rolled out deployment "dpf-operator-argocd-redis" successfully rolled out deployment "dpf-operator-argocd-repo-server" successfully rolled out deployment "dpf-operator-argocd-server" successfully rolled out deployment "dpf-operator-controller-manager" successfully rolled out deployment "dpf-operator-kamaji" successfully rolled out deployment "dpf-operator-maintenance-operator" successfully rolled out deployment "dpf-operator-node-feature-discovery-gc" successfully rolled out deployment "dpf-operator-node-feature-discovery-master" successfully rolled out deployment "dpf-provisioning-controller-manager" successfully rolled out deployment "dpuservice-controller-manager" successfully rolled out deployment "kamaji-cm-controller-manager" successfully rolled out deployment "static-cm-controller-manager" successfully rolled out     $ kubectl wait --for=condition=ready --namespace dpu-cplane-tenant1 dpucluster --all dpucluster.provisioning.dpu.nvidia.com/dpu-cplane-tenant1 condition met

    DPU Provisioning

    Run the following command from the Jump Node console to verify BMC version (25.04-2 is the recomended BMC FW version):

    Jump Node Console

    Copy
    Copied!
                

    $ curl -k -u root:'3tango11!OBMC' https://10.0.110.201/redfish/v1/UpdateService/FirmwareInventory/BMC_Firmware { "@odata.id": "/redfish/v1/UpdateService/FirmwareInventory/BMC_Firmware", "@odata.type": "#SoftwareInventory.v1_4_0.SoftwareInventory", "Description": "BMC image", "Id": "BMC_Firmware", "Manufacturer": "", "Name": "Software Inventory", "RelatedItem": [], "RelatedItem@odata.count": 0, "SoftwareId": "0x0018", "Status": { "Conditions": [], "Health": "OK", "HealthRollup": "OK", "State": "Enabled" }, "Updateable": true, "Version": "BF-23.04", "WriteProtected": false

    If you have an older BMC version, run the following steps to update DPU BMC, EUFI, and firmware:

    1. Download a relevant bfb image.

      Jump Node Console

      Copy
      Copied!
                  

      $ wget https://content.mellanox.com/BlueField/BFBs/Ubuntu22.04/bf-bundle-3.0.0-135_25.04_ubuntu-22.04_prod.bfb

    2. Create a bf.cfg file.

      Jump Node Console

      Copy
      Copied!
                  

      $ vim bf.cfg   BMC_PASSWORD="$(tr -dc 'A-Za-z0-9' </dev/urandom | head -c 4)-$(tr -dc 'A-Za-z0-9' </dev/urandom | head -c 4)_$(tr -dc '0-9' </dev/urandom | head -c 2)$(tr -dc 'a-z' </dev/urandom | head -c 1)$(tr -dc 'A-Z' </dev/urandom | head -c 1)" BMC_USER="firmware_updater" BMC_REBOOT="yes" CEC_REBOOT="yes" USER_ID=8   pre_bmc_components_update() { ipmitool user set name $USER_ID $BMC_USER ipmitool user set password $USER_ID $BMC_PASSWORD ipmitool user enable $USER_ID ipmitool channel setaccess 1 $USER_ID ipmi=on ipmitool user priv $USER_ID 0x4 1 }   post_bmc_components_update() { ipmitool user set name $USER_ID "" }

    3. Run following command.

      Jump Node Console

      Copy
      Copied!
                  

      $ cat bf-bundle-3.0.0-135_25.04_ubuntu-22.04_prod.bfb bf.cfg > bfb-install.bfb

    4. Connect to the DPU over SSH and start the rshimservice.

      Jump Node Console

      Copy
      Copied!
                  

      $ ssh root@10.0.110.201 root@10.0.110.201's password: <BMC Root Password. Default root/0penBmc. need to change first time>

    5. Start the rshimservice.

      Jump Node Console

      Copy
      Copied!
                  

      root@dpu-bmc:~# systemctl enable rshim root@dpu-bmc:~# systemctl start rshim root@dpu-bmc:~# systemctl status rshim * rshim.service - rshim driver for BlueField SoC Loaded: loaded (/usr/lib/systemd/system/rshim.service; enabled; preset: disabled) Active: active (running) since Wed 2025-04-23 14:21:43 UTC; 24h ago Docs: man:rshim(8) Main PID: 940 (rshim) CPU: 3h 39min 40.138s CGroup: /system.slice/rshim.service `-940 /usr/sbin/rshim   Apr 23 14:21:42 dpu-bmc (rshim)[908]: rshim.service: Referenced but unset environment variable evaluates to an empty string: OPTIONS Apr 23 14:21:42 dpu-bmc rshim[940]: Created PID file: /var/run/rshim.pid Apr 23 14:21:43 dpu-bmc rshim[940]: USB device detected Apr 23 14:21:47 dpu-bmc rshim[940]: Probing usb-2.1 Apr 23 14:21:47 dpu-bmc rshim[940]: create rshim usb-2.1 Apr 23 14:21:48 dpu-bmc rshim[940]: rshim0 attached   root@dpu-bmc:~# exit logout Connection to 10.0.110.201 closed. [depuser@setup5-jump ~]$

    6. Open an additional console to the Jump node. And connect to DPU OOB to monitor the update process status.

      Jump Node and DPU OOB Console

      Copy
      Copied!
                  

      $ ssh root@10.0.110.201 root@10.0.110.201's password: root@dpu-bmc:~# obmc-console-client   dpu-device-1 login:

    7. Return to the Jump node console and run the command to start the BMC, EUFI and firmware update process.

      Jump Node Console

      Copy
      Copied!
                  

      $ scp bfb-install.bfb root@10.0.110.201:/dev/rshim0/boot

    8. Return to the DPU OOB console. Wait ~20-25 minutes to update process finnish.

      Jump Node Console

      Copy
      Copied!
                  

      [13:26:32] No active BMC task [13:26:32] Updating BMC firmware [13:26:32] Found BMC firmware image: /mnt/lib/firmware/mellanox/bmc/bf3-bmc-fw.fwpkg [13:26:32] Provided BMC firmware version: 25.04-2 [13:26:32] - INFO: BMC_FIRMWARE_URL: /redfish/v1/UpdateService/FirmwareInventory/BMC_Firmware [13:26:32] Running BMC firmware version: 23.09-6 [13:26:32] Proceeding with the BMC firmware update. [13:26:33] curl -sSk -u <BMC_USER:BMC_PASSWORD> -H Content-Type: application/octet-stream -X POST -T /mnt/lib/firmware/mellanox/bmc/bf3-bmc-fw.fwpkg https://192.168.240.1/redfish/v1/UpdateService [13:26:41] BMC Firmware update: { "@odata.id": "/redfish/v1/TaskService/Tasks/0", "@odata.type": "#Task.v1_4_3.Task", "Id": "0", "TaskState": "Running", "TaskStatus": "OK" } [13:26:44] Task id: /redfish/v1/TaskService/Tasks/0 [13:39:32] INFO: BMC firmware was updated to: 25.04-2 [13:39:32] BFB-Installer: Installing BMC Image passed, total 64% complete [13:39:33] Task id: /redfish/v1/TaskService/Tasks/0 [13:39:33] Updating CEC firmware [13:39:33] Found CEC firmware image: /mnt/lib/firmware/mellanox/cec/bf3-cec-fw.fwpkg [13:39:33] Provided CEC firmware version: 00.02.0195.0000 [13:39:33] Running CEC firmware version: 00.02.0127.0000 [13:39:33] Proceeding with the CEC firmware update... [13:39:33] curl -sSk -u <BMC_USER:BMC_PASSWORD> -H Content-Type: application/octet-stream -X POST -T /mnt/lib/firmware/mellanox/cec/bf3-cec-fw.fwpkg https://192.168.240.1/redfish/v1/UpdateService [13:39:35] CEC Firmware update: { "@odata.id": "/redfish/v1/TaskService/Tasks/1", "@odata.type": "#Task.v1_4_3.Task", "Id": "1", "TaskState": "Running", "TaskStatus": "OK" } [13:39:38] Task id: /redfish/v1/TaskService/Tasks/1 [13:39:59] INFO: CEC firmware was updated to 00.02.0195.0000. Host power cycle is required [13:39:59] BFB-Installer: Installing Glacier Image passed, total 65% complete [13:39:59] Rebooting BMC... Connection to 10.0.110.201 closed by remote host. Connection to 10.0.110.201 closed.

    9. Power cycle the server with update DPU.
    10. Run the following command from the Jump node console to verify the BMC version:

      Jump Node Console

      Copy
      Copied!
                  

      $ curl -k -u root:'3tango11!OBMC' https://10.0.110.201/redfish/v1/UpdateService/FirmwareInventory/BMC_Firmware { "@odata.id": "/redfish/v1/UpdateService/FirmwareInventory/BMC_Firmware", "@odata.type": "#SoftwareInventory.v1_4_0.SoftwareInventory", "Description": "BMC image", "Id": "BMC_Firmware", "Manufacturer": "", "Name": "Software Inventory", "RelatedItem": [], "RelatedItem@odata.count": 0, "SoftwareId": "0x0018", "Status": { "Conditions": [], "Health": "OK", "HealthRollup": "OK", "State": "Enabled" }, "Updateable": true, "Version": "BF-25.04-2", "WriteProtected": false

    Repeat the step 4-10 on the DPU 2.

    To authenticate with Redfish, it is necesasry to provide a password for the BMC root user.

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl create secret generic -n dpf-operator-system bmc-shared-password --from-literal=password='ROOT_BMC_PASSWORD'

    Create the following YAML to define a DPUDevice:

    manifests/04-dpudeployment-installation/create-dpu-devices.yaml

    Copy
    Copied!
                

    --- apiVersion: provisioning.dpu.nvidia.com/v1alpha1 kind: DPUDevice metadata: name: dpu-device-1 namespace: dpf-operator-system spec: bmcIp: 10.0.110.201   --- apiVersion: provisioning.dpu.nvidia.com/v1alpha1 kind: DPUDevice metadata: name: dpu-device-2 namespace: dpf-operator-system spec: bmcIp: 10.0.110.202

    Run the command to create a DPUDevice:

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl apply -f manifests/04-dpudeployment-installation/create-dpu-devices.yaml

    Verify the DPF system by ensuring that the DPUDevices exist:

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl get dpudevices -n dpf-operator-system NAME AGE dpu-device-1 7s dpu-device-2 7s

    Create the following YAML to define a DPUNode:

    manifests/04-dpudeployment-installation/create-dpu-nodes.yaml

    Copy
    Copied!
                

    --- apiVersion: provisioning.dpu.nvidia.com/v1alpha1 kind: DPUNode metadata: name: dpuworker1 namespace: dpf-operator-system spec: nodeRebootMethod: external: {} # DPU will be rebooted externally (via BMC/IPMI) dpus: - name: dpu-device-1 # Name of the previously created DPUDevice --- apiVersion: provisioning.dpu.nvidia.com/v1alpha1 kind: DPUNode metadata: name: dpuworker2 namespace: dpf-operator-system spec: nodeRebootMethod: external: {} dpus: - name: dpu-device-2

    Run the command to create a DPUNode:

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl apply -f manifests/04-dpudeployment-installation/create-dpu-nodes.yaml

    Verify the DPF system by ensuring that the DPUDevices exist.

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl get dpunodes -n dpf-operator-system NAME AGE dpuworker1 8s dpuworker2 8s

    Use the following YAML to define a BFB resource that downloads the Bluefield Bitstream to a shared volume:

    manifests/04-dpudeployment-installation/bfb.yaml

    Copy
    Copied!
                

    --- apiVersion: provisioning.dpu.nvidia.com/v1alpha1 kind: BFB metadata: name: bf-bundle namespace: dpf-operator-system spec: url: $BLUEFIELD_BITSTREAM

    Run the command to create the BFB:

    Jump Node Console

    Copy
    Copied!
                

    $ cat manifests/04-dpudeployment-installation/bfb.yaml | envsubst |kubectl apply -f -

    Add labels to DPUNodes. Set the values according to your environment.

    Jump Node Console

    Copy
    Copied!
                

    kubectl label dpunodes.provisioning.dpu.nvidia.com -n dpf-operator-system dpuworker1 feature.node.kubernetes.io/dpu-0-pf0-name=ens1f0np0 kubectl label dpunodes.provisioning.dpu.nvidia.com -n dpf-operator-system dpuworker1 feature.node.kubernetes.io/dpu-0-number-of-pfs=2 kubectl label dpunodes.provisioning.dpu.nvidia.com -n dpf-operator-system dpuworker1 feature.node.kubernetes.io/dpu-oob-bridge-configured="" kubectl label dpunodes.provisioning.dpu.nvidia.com -n dpf-operator-system dpuworker1 feature.node.kubernetes.io/dpu-enabled=true kubectl label dpunodes.provisioning.dpu.nvidia.com -n dpf-operator-system dpuworker1 feature.node.kubernetes.io/dpu-0-pci-address=0000-2b-00   kubectl label dpunodes.provisioning.dpu.nvidia.com -n dpf-operator-system dpuworker2 feature.node.kubernetes.io/dpu-0-pf0-name=ens1f0np0 kubectl label dpunodes.provisioning.dpu.nvidia.com -n dpf-operator-system dpuworker2 feature.node.kubernetes.io/dpu-0-number-of-pfs=2 kubectl label dpunodes.provisioning.dpu.nvidia.com -n dpf-operator-system dpuworker2 feature.node.kubernetes.io/dpu-oob-bridge-configured="" kubectl label dpunodes.provisioning.dpu.nvidia.com -n dpf-operator-system dpuworker2 feature.node.kubernetes.io/dpu-enabled=true kubectl label dpunodes.provisioning.dpu.nvidia.com -n dpf-operator-system dpuworker2 feature.node.kubernetes.io/dpu-0-pci-address=0000-2b-00

    Create the following YAML to define a DPUSet:

    manifests/04-dpudeployment-installation/create-dpu-set.yaml

    Copy
    Copied!
                

    --- apiVersion: provisioning.dpu.nvidia.com/v1alpha1 kind: DPUSet metadata:  name: dpuset  namespace: dpf-operator-system spec:  strategy:    rollingUpdate:     maxUnavailable: "10%"      type: RollingUpdate  dpuTemplate:    spec:      dpuFlavor: dpf-provisioning-hbn-ovn      bfb:       name: bf-bundle      nodeEffect:       noEffect: true

    Run the command to create a DPUSet:

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl apply -f manifests/04-dpudeployment-installation/create-dpu-set.yaml

    To follow the progress of DPU provisioning, run the following command to check its current phase:

    Jump Node Console

    Copy
    Copied!
                

    $ watch -n10 "kubectl describe dpu -n dpf-operator-system | grep 'Node Name\|Type\|Last\|Phase'" Every 10.0s: kubectl describe dpu -n dpf-operator-system | grep 'Node Name\|Type\|Last\|Phase' setup5-jump: Wed May 21 10:45:44 2025   Dpu Node Name: dpuworker1 Type: InternalIP Type: Hostname Last Transition Time: 2025-05-21T07:23:09Z Type: Initialized Last Transition Time: 2025-05-21T07:23:09Z Type: BFBReady Last Transition Time: 2025-05-21T07:23:11Z Type: NodeEffectReady Last Transition Time: 2025-05-21T07:23:15Z Type: InterfaceInitialized Last Transition Time: 2025-05-21T07:23:17Z Type: FWConfigured Last Transition Time: 2025-05-21T07:23:18Z Type: BFBPrepared Last Transition Time: 2025-05-21T07:27:25Z Type: OSInstalled Last Transition Time: 2025-05-21T07:44:54Z Type: Rebooted   Dpu Node Name: dpuworker2 Type: InternalIP Type: Hostname Last Transition Time: 2025-05-21T07:23:08Z Type: Initialized Last Transition Time: 2025-05-21T07:23:09Z Type: BFBReady Last Transition Time: 2025-05-21T07:23:09Z Type: NodeEffectReady Last Transition Time: 2025-05-21T07:23:12Z Type: InterfaceInitialized Last Transition Time: 2025-05-21T07:23:14Z Type: FWConfigured Last Transition Time: 2025-05-21T07:23:15Z Type: BFBPrepared Last Transition Time: 2025-05-21T07:27:23Z Type: OSInstalled Last Transition Time: 2025-05-21T07:45:01Z Type: Rebooted

    Wait for the Rebooted stage and then Power Cycle the bare-metal host manual.

    After the DPU is up, run following command for each DPU worker:

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl annotate dpunodes -n dpf-operator-system dpuworker1 provisioning.dpu.nvidia.com/dpunode-external-reboot-required-   $ kubectl annotate dpunodes -n dpf-operator-system dpuworker2 provisioning.dpu.nvidia.com/dpunode-external-reboot-required-

    At this point, the DPU workers should be added to the cluster. As they being added to the cluster, the DPUs are provisioned.

    Jump Node Console

    Copy
    Copied!
                

    $ watch -n10 "kubectl describe dpu -n dpf-operator-system | grep 'Node Name\|Type\|Last\|Phase'" Every 10.0s: kubectl describe dpu -n dpf-operator-system | grep 'Node Name\|Type\|Last\|Phase' setup5-jump: Wed May 21 10:45:44 2025   Dpu Node Name: dpuworker1 Type: InternalIP Type: Hostname Last Transition Time: 2025-05-21T07:23:09Z Type: Initialized Last Transition Time: 2025-05-21T07:23:09Z Type: BFBReady Last Transition Time: 2025-05-21T07:23:11Z Type: NodeEffectReady Last Transition Time: 2025-05-21T07:23:15Z Type: InterfaceInitialized Last Transition Time: 2025-05-21T07:23:17Z Type: FWConfigured Last Transition Time: 2025-05-21T07:23:18Z Type: BFBPrepared Last Transition Time: 2025-05-21T07:27:25Z Type: OSInstalled Last Transition Time: 2025-05-21T07:44:54Z Type: Rebooted Last Transition Time: 2025-05-21T07:44:54Z Type: DPUClusterReady Last Transition Time: 2025-05-21T07:44:55Z Type: Ready Phase: Ready Dpu Node Name: dpuworker2 Type: InternalIP Type: Hostname Last Transition Time: 2025-05-21T07:23:08Z Type: Initialized Last Transition Time: 2025-05-21T07:23:09Z Type: BFBReady Last Transition Time: 2025-05-21T07:23:09Z Type: NodeEffectReady Last Transition Time: 2025-05-21T07:23:12Z Type: InterfaceInitialized Last Transition Time: 2025-05-21T07:23:14Z Type: FWConfigured Last Transition Time: 2025-05-21T07:23:15Z Type: BFBPrepared Last Transition Time: 2025-05-21T07:27:23Z Type: OSInstalled Last Transition Time: 2025-05-21T07:45:01Z Type: Rebooted Last Transition Time: 2025-05-21T07:45:01Z Type: DPUClusterReady Last Transition Time: 2025-05-21T07:45:02Z Type: Ready Phase: Ready

    Finally, validate that all the different DPU-related objects are now in the Ready state:

    Jump Node Console

    Copy
    Copied!
                

    $ kubectl get secrets -n dpu-cplane-tenant1 dpu-cplane-tenant1-admin-kubeconfig -o json | jq -r '.data["admin.conf"]' | base64 --decode > /home/depuser/dpu-cluster.config   $ KUBECONFIG=/home/depuser/dpu-cluster.config k get node -A NAME STATUS ROLES AGE VERSION dpu-device-1 Ready <none> 94s v1.30.12 dpu-device-2 Ready <none> 84s v1.30.12   $ kubectl get dpu -A NAMESPACE NAME READY PHASE AGE dpf-operator-system dpu-device-1 True Ready 21m dpf-operator-system dpu-device-2 True Ready 21m   $ kubectl wait --for=condition=ready --namespace dpf-operator-system dpu --all dpu.provisioning.dpu.nvidia.com/dpu-device-1 condition met dpu.provisioning.dpu.nvidia.com/dpu-device-2 condition met

    Congratulations, the DPF system has been successfully installed!

    Authors

    BK-version-2-modificationdate-1697457536297-api-v2.jpg

    Boris Kovalev

    Boris Kovalev has worked for the past several years as a Solutions Architect, focusing on NVIDIA Networking/Mellanox technology, and is responsible for complex machine learning, Big Data and advanced VMware-based cloud research and design. Boris previously spent more than 20 years as a senior consultant and solutions architect at multiple companies, most recently at VMware. He has written multiple reference designs covering VMware, machine learning, Kubernetes, and container solutions which are available at the NVIDIA Documents website.

    NVIDIA, the NVIDIA logo, and BlueField are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. TM

    © 2025 NVIDIA Corporation. All rights reserved.

    This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality. NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice. Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete. NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

    Last updated on May 29, 2025.