DPF Setup for NICo Integration

View as Markdown

Introduction

NICo supports two ways of provisioning DPUs:

  1. iPXE based
  2. DPF based

This manual covers deployment of DPF based provisioning as it is used by NICo. It assumes that a working Kubernetes cluster is already available, and is intentionally agnostic to the specific cluster implementation (kubeadm, k3s, RKE2, managed clouds, etc.)—any conformant cluster that satisfies the DPF prerequisites is acceptable.

This guide is not a replacement for the official DPF documentation. The authoritative source for installing and configuring DPF is the upstream guide:

NICo is designed to follow the Zero-Trust use case detailed in the DPF documentation: DPF Zero-Trust Mode - HBN Usecase.

You should follow that guide as the base. The instructions below only describe the deltas, additions, and tweaks that need to be applied on top of the official DPF flow so that NICo can integrate with the resulting DPF installation. This manual is based on DPF 26.04; minor adjustments may be necessary on other versions and on environments other than a development setup.

The guide is organized into the following sections:

  1. Prerequisites — work that must be done before installing DPF.
  2. DPF Installation — NICo-relevant notes when installing the DPF operator.
  3. Post-Installation Configuration — the cluster state and NICo configuration that must be in place after DPF is installed and before NICo starts.
  4. Restart carbide-api — what NICo creates on startup, and why a restart is required to apply DPF config changes.

Note: NICo expects DPF to be installed and configured on the same Kubernetes cluster where NICo (the controller) runs.


1. Prerequisites

The official DPF guide lists a set of cluster-level prerequisites (Argo CD, cert-manager, Kamaji etc.). Follow that guide for those components.

NICo reuses several of those same components (notably Argo CD and cert-manager). If they are already installed for NICo, do not reinstall them — only configure the missing pieces and adapt the existing installations so DPF can use them. The subsections below cover the prerequisite configuration that is specific to a NICo + DPF deployment.

1.1. Create the DPF operator namespace

All DPF operator workloads, secrets, ConfigMaps, and CRs live in the dpf-operator-system namespace. Create it idempotently:

$kubectl get namespace dpf-operator-system &>/dev/null \
> || kubectl create namespace dpf-operator-system

1.2. Image pull and helm repository credentials

Access to the DPF staging Helm chart and related container images requires authentication through NVIDIA NGC. Both the DPF operator and the workloads it deploys will need credentials for pulling Helm charts and container images from private registries. For detailed instructions, see: https://docs.nvidia.com/networking/display/dpf25101/using-private-registries.

1.2.a. hbn-user-password Secret

A random local credential pair used by the HBN (Host-Based Networking) DPUService, which runs FRR on the DPU. The DPF operator picks this Secret up by label.

$kubectl -n dpf-operator-system create secret generic hbn-user-password \
> --from-literal=password=`tr -dc 'a-z0-9' < /dev/urandom | head -c 10` \
> || kubectl get secret hbn-user-password -n dpf-operator-system
$
$kubectl -n dpf-operator-system label secret hbn-user-password \
> dpu.nvidia.com/image-pull-secret=""

The dpu.nvidia.com/image-pull-secret="" label is a DPF convention that tells the operator “propagate this Secret into DPUService image-pull secrets.” The label is reused even though this is not strictly an image-pull Secret — DPF’s controllers selector-match on this label to mirror Secrets onto the DPU cluster.

1.2.b. dpf-pull-secret docker-registry Secret

Credentials for nvcr.io, used by the DPF operator and by the operands it deploys to pull staging images.

$kubectl -n dpf-operator-system create secret docker-registry dpf-pull-secret \
> --docker-server=nvcr.io \
> --docker-username='$oauthtoken' \
> --docker-password="$NGC_API_KEY" \
> || kubectl get secret dpf-pull-secret -n dpf-operator-system
$
$kubectl -n dpf-operator-system label secret dpf-pull-secret \
> dpu.nvidia.com/image-pull-secret=""

1.2.c. Secret to pull NICo docker service images

Credentials for nvcr.io, used by the DPF operator to download NICo service images.

$kubectl -n dpf-operator-system create secret docker-registry nico-pull-secret \
> --docker-server=nvcr.io \
> --docker-username='$oauthtoken' \
> --docker-password="$NGC_API_KEY_WITH_NICO_DOCKER_IMAGE_ACCESS" \
> || kubectl get secret nico-pull-secret -n dpf-operator-system
$
$kubectl -n dpf-operator-system label secret nico-pull-secret \
> dpu.nvidia.com/image-pull-secret=""

1.2.d. Argo CD repository Secrets for Helm charts

DPF pulls several Helm charts via Argo CD. Apply the following Secrets so that Argo CD can authenticate to the NGC Helm repositories:

1---
2apiVersion: v1
3kind: Secret
4metadata:
5 name: ngc-doca-oci-helm
6 namespace: argocd
7 labels:
8 argocd.argoproj.io/secret-type: repository
9stringData:
10 name: nvstaging-doca-oci
11 url: nvcr.io/nvstaging/doca
12 type: helm
13 password: $NGC_API_KEY
14data:
15 # $oauthtoken base64 encoded. This prevents envsubst from substituting the value.
16 username: JG9hdXRodG9rZW4=
17 ## true
18 enableOCI: dHJ1ZQ==
19---
20apiVersion: v1
21kind: Secret
22metadata:
23 name: ngc-doca-https-helm
24 namespace: argocd
25 labels:
26 argocd.argoproj.io/secret-type: repository
27stringData:
28 name: nvstaging-doca-https
29 url: https://helm.ngc.nvidia.com/nvstaging/doca
30 type: helm
31 password: $NGC_API_KEY
32data:
33 username: JG9hdXRodG9rZW4=
34---
35apiVersion: v1
36kind: Secret
37metadata:
38 name: ngc-carbide-https-helm
39 namespace: argocd
40 labels:
41 argocd.argoproj.io/secret-type: repository
42stringData:
43 name: nvstaging-carbide-https
44 url: https://helm.ngc.nvidia.com/0837451325059433/carbide-dev
45 type: helm
46 password: $NGC_API_KEY
47data:
48 username: JG9hdXRodG9rZW4=

Each Secret is labelled argocd.argoproj.io/secret-type: repository, which is how Argo CD discovers Helm repositories.

Important: the url field must not end with a /, as any difference in the url (including an extra slash) will prevent Argo CD from matching the repository to the correct Secret.

Secret nameRepo URLTypeUsed by
ngc-doca-oci-helmnvcr.io/nvstaging/docaOCI helmDPF operator chart pulls
ngc-doca-https-helmhttps://helm.ngc.nvidia.com/nvstaging/docaHTTPS helmSome DPUService charts
ngc-carbide-https-helmhttps://helm.ngc.nvidia.com/0837451325059433/carbide-devHTTPS helmCarbide-private DPUService charts

1.3. Cert-manager policy and RBAC for DPF

DPF relies on cert-manager to mint short-lived certificates. If the cluster runs approver-policy (CRD policy.cert-manager.io/CertificateRequestPolicy), no CSR will be approved unless a matching policy whitelists it, and DPF’s CSRs will hang in Pending indefinitely.

Two objects must therefore be installed:

  1. A CertificateRequestPolicy that is permissive for the dpf-operator-system namespace.
  2. A ClusterRole + ClusterRoleBinding granting cert-manager itself the use verb on that policy.

Note: The policy and role below use wildcard (*) values for convenience. In production, the exact set of allowed names, SANs, and usages should be tightened with help from the DPF team.

policy.yaml

1---
2apiVersion: policy.cert-manager.io/v1alpha1
3kind: CertificateRequestPolicy
4metadata:
5 labels:
6 argocd.argoproj.io/instance: dpf-pki-policies
7 name: dpf-approval-policy
8spec:
9 selector:
10 namespace:
11 matchNames: [dpf-operator-system]
12 issuerRef:
13 name: '*'
14 kind: '*'
15 group: '*'
16 allowed:
17 commonName:
18 value: '*'
19 dnsNames:
20 values: ['*']
21 ipAddresses:
22 values: ['*']
23 uris:
24 values: ['*']
25 emailAddresses:
26 values: ['*']
27 isCA: true
28 usages:
29 - server auth
30 - client auth
31 - digital signature
32 - key encipherment

This allows any CertificateRequest in the dpf-operator-system namespace, against any issuer, with any SAN (DNS / IP / URI / email), CA or not, with the listed usages.

rbac-role.yaml

1---
2apiVersion: rbac.authorization.k8s.io/v1
3kind: ClusterRole
4metadata:
5 name: cert-manager-policy:dpf-approval-policy
6rules:
7 - apiGroups: [policy.cert-manager.io]
8 resources: [certificaterequestpolicies]
9 verbs: [use]
10 resourceNames: [dpf-approval-policy]
11---
12apiVersion: rbac.authorization.k8s.io/v1
13kind: ClusterRoleBinding
14metadata:
15 name: cert-manager-policy:dpf-approval-policy
16roleRef:
17 apiGroup: rbac.authorization.k8s.io
18 kind: ClusterRole
19 name: cert-manager-policy:dpf-approval-policy
20subjects:
21 - kind: ServiceAccount
22 name: cert-manager
23 namespace: cert-manager

Without this binding cert-manager’s controller cannot reference the policy and all DPF CSRs will hang in pending.


2. DPF Installation

Follow the upstream DPF installation guide for the actual install procedure:

When installing the DPF operator chart, two parameter overrides are required for a NICo-integrated deployment. The example command below illustrates how to set them:

$REGISTRY="oci://path/to/doca"
$TAG="v26.4.0-rc.3"
$helm upgrade --install -n dpf-operator-system \
> --set "enableNodeFeatureRules=false" \
> --set "imagePullSecrets[0].name=dpf-pull-secret" \
> dpf-operator $REGISTRY/dpf-operator --version=$TAG

NICo-specific notes on the parameters:

  • enableNodeFeatureRules=false — the chart’s bundled NodeFeatureRule resources are disabled because nodes are labeled via NFD’s own configuration (relying on PCI class 0200).
  • imagePullSecrets[0].name=dpf-pull-secret — ties the operator’s pods to the pull Secret created in step 1.2.b so that staging images can be pulled.

Adjust REGISTRY and TAG to the version of DPF you are deploying.


3. Post-Installation Configuration (before NICo starts)

Once the DPF operator is running, the following objects must be applied before NICo is started. They configure the DPF operator for NICo’s provisioning model and grant the orchestrator the access it needs.

3.1. Cluster-wide RBAC for the NICo orchestrator

The NICo orchestrator (the carbide-api ServiceAccount in NICo’s default deployment) needs to read and write across namespaces, including dpf-operator-system and the per-DPU namespaces. Grant it cluster-admin via a ClusterRoleBinding:

1---
2apiVersion: rbac.authorization.k8s.io/v1
3kind: Role
4metadata:
5 name: nico-api-dpf
6 namespace: dpf-operator-system
7rules:
8 - apiGroups: ["provisioning.dpu.nvidia.com"]
9 resources: ["bfbs"]
10 verbs: ["get", "list", "create", "patch", "delete"]
11 - apiGroups: ["provisioning.dpu.nvidia.com"]
12 resources: ["dpus"]
13 verbs: ["get", "list", "watch", "patch", "delete"]
14 - apiGroups: ["provisioning.dpu.nvidia.com"]
15 resources: ["dpudevices"]
16 verbs: ["get", "list", "create", "patch", "delete"]
17 - apiGroups: ["provisioning.dpu.nvidia.com"]
18 resources: ["dpunodes"]
19 verbs: ["get", "list", "create", "patch", "delete"]
20 - apiGroups: ["provisioning.dpu.nvidia.com"]
21 resources: ["dpunodemaintenances"]
22 verbs: ["get", "patch"]
23 - apiGroups: ["provisioning.dpu.nvidia.com"]
24 resources: ["dpuflavors"]
25 verbs: ["get", "create"]
26 - apiGroups: ["provisioning.dpu.nvidia.com"]
27 resources: ["dpusets"]
28 verbs: ["get"]
29 - apiGroups: ["provisioning.dpu.nvidia.com"]
30 resources: ["dpuclusters"]
31 verbs: ["get", "list"]
32 - apiGroups: ["svc.dpu.nvidia.com"]
33 resources: ["dpudeployments"]
34 verbs: ["get", "list", "create", "patch", "delete"]
35 - apiGroups: ["svc.dpu.nvidia.com"]
36 resources: ["dpuservices", "dpuservicechains"]
37 verbs: ["get", "list", "create", "patch", "delete"]
38 - apiGroups: ["svc.dpu.nvidia.com"]
39 resources: ["dpuserviceinterfaces", "dpuservicetemplates", "dpuserviceconfigurations", "dpuservicenads"]
40 verbs: ["get", "list", "create", "patch", "delete"]
41 - apiGroups: ["operator.dpu.nvidia.com"]
42 resources: ["dpfoperatorconfigs"]
43 verbs: ["get", "patch"]
44 - apiGroups: [""]
45 resources: ["secrets"]
46 verbs: ["get", "create"]
47---
48apiVersion: rbac.authorization.k8s.io/v1
49kind: RoleBinding
50metadata:
51 name: nico-api-dpf
52 namespace: dpf-operator-system
53roleRef:
54 apiGroup: rbac.authorization.k8s.io
55 kind: Role
56 name: nico-api-dpf
57subjects:
58 - kind: ServiceAccount
59 name: carbide-api
60 namespace: forge-system

3.2. DPFOperatorConfig

This is the operator-level CR that tells DPF how to behave in a NICo environment. For more information about the available fields and their details, refer to the official DPF guide.

1---
2apiVersion: operator.dpu.nvidia.com/v1alpha1
3kind: DPFOperatorConfig
4metadata:
5 name: dpfoperatorconfig
6 namespace: dpf-operator-system
7spec:
8 dpuDetector:
9 disable: true
10 provisioningController:
11 osInstallTimeout: "60m"
12 installInterface:
13 installViaRedfish:
14 skipDPUNodeDiscovery: true
15 overrides:
16 # Replace with the IP of the KubeAPI server where DPF control plane is running
17 kubernetesAPIServerVIP: "REPLACE_ME"
18 # Replace with the port of the KubeAPI server where DPF control plane is running
19 # don't quote "" as it should be integer
20 kubernetesAPIServerPort: REPLACE_ME
21 argoCDNamespace: argocd
22 kamajiClusterManager:
23 disable: false
24 networking:
25 highSpeedMTU: 9000
26 imagePullSecrets:
27 - dpf-pull-secret

Field-by-field:

FieldMeaning
dpuDetector.disable: trueDPF normally polls hosts to discover new DPUs. NICo disables auto-discovery because DPUs are fed in via DPUSet CRs from the orchestrator.
provisioningController.osInstallTimeout: "60m"Total budget for the OS install flow per DPU.
provisioningController.installViaRedfishProvision DPUs by talking Redfish to the host BMC (vs. PXE-based).
skipDPUNodeDiscovery: trueDo not auto-detect DPUs as Kubernetes nodes — DPF is told about them explicitly by NICo.
overrides.kubernetesAPIServerVIPReplace REPLACE_ME with the host-cluster API-server VIP that DPUs should reach.
overrides.kubernetesAPIServerPortHost-cluster API-server port (6443 by default).
overrides.argoCDNamespaceNamespace where Argo CD is installed.
kamajiClusterManager.disable: falseUse Kamaji as the DPU control plane.
networking.highSpeedMTU: 9000Jumbo frames on the high-speed fabric.
imagePullSecrets: dpf-pull-secretPull Secret inserted into every DPUService spawned by the operator.

3.3. DPUCluster

The DPUCluster CR defines the Kubernetes control plane that DPU nodes will join. The interface and vip fields must be customized for the environment. For more information about the available fields and their details, refer to the official DPF guide.

1---
2apiVersion: provisioning.dpu.nvidia.com/v1alpha1
3kind: DPUCluster
4metadata:
5 name: carbide-dpf-cluster
6 namespace: dpf-operator-system
7spec:
8 type: kamaji
9 maxNodes: 1000
10 clusterEndpoint:
11 keepalived:
12 # Controller interface where the Kamaji cluster IP is configured
13 interface: "REPLACE_ME"
14 # External IP used by the Kamaji cluster that needs to be accessible from the DPUs
15 vip: "REPLACE_ME"
16 virtualRouterID: 126
17 nodeSelector:
18 # Confirm this with node. Some env can have this as 'true' also.
19 # kubectl get node <node-name> -o jsonpath='{.metadata.labels.node-role\.kubernetes\.io/control-plane}'
20 node-role.kubernetes.io/control-plane: ""

Field-by-field:

FieldMeaning
type: kamajiUse the Kamaji cluster manager; the DPU control plane runs as a Kamaji TenantControlPlane in the host cluster.
maxNodes: 1000Hard cap on DPU nodes that can join.
clusterEndpoint.keepalived.interfaceHost network interface on which keepalived advertises the VIP.
clusterEndpoint.keepalived.vipFloating IP that DPU nodes use to reach their control plane.
clusterEndpoint.keepalived.virtualRouterID: 126VRRP ID; must be unique per host if multiple keepalived instances run there.
nodeSelectorSchedule keepalived only on control-plane nodes.

3.4. VIP LoadBalancer Service and Endpoints

This step exposes the Kamaji cluster IP so it is routable from the DPUs. It may not be required in environments where routing to the VIP is already in place; in that case skip it.

The Service uses a fixed loadBalancerIP matching the VIP set in the DPUCluster above. Replace the loadBalancerIP value before applying.

Note: It only applies for MetalLB-managed deployments.

1apiVersion: v1
2kind: Service
3metadata:
4 name: dpu-cluster-vip-loadbalancer
5 namespace: dpf-operator-system
6 annotations:
7 metallb.io/address-pool: 'REPLACE_ME'
8spec:
9 allocateLoadBalancerNodePorts: true
10 loadBalancerIP: "External IP used by the Kamaji cluster"
11 ports:
12 - port: 80
13 targetPort: 80
14 protocol: TCP
15 type: LoadBalancer
16---
17apiVersion: v1
18kind: Endpoints
19metadata:
20 name: dpu-cluster-vip-loadbalancer
21 namespace: dpf-operator-system
22subsets:
23- addresses:
24 - ip: 192.0.2.10 # dummy/test IP (RFC 5737 range)
25 ports:
26 - port: 80

What this does and why it looks unusual:

  • The Service is type LoadBalancer with a fixed loadBalancerIP (the same VIP used by the DPUCluster keepalived). The metallb.io/address-pool: REPLACE_ME annotation should be updated with a correct pool name. It tells MetalLB to pull the IP from the updated pool defined elsewhere.
  • A manually-created Endpoints object with a single dummy RFC 5737 IP (192.0.2.10) is created with the same name as the Service. This is a Kubernetes idiom: when an Endpoints resource has the same name as a Service that has no selector, the kubelet uses those Endpoints verbatim. Putting a dummy IP here means: “reserve the VIP via MetalLB, but route nothing — keepalived is the actual front-end.”
  • Net effect: MetalLB advertises the VIP to the network so external machines (DPUs, BMCs) can reach it, while keepalived handles the actual TCP termination.

If your environment uses a different LoadBalancer mechanism (kube-vip, a cloud-provider LB, etc.), use it to expose the VIP and point the DPUCluster’s keepalived.vip at the same address.

3.5. Enable DPF in the NICo site config

DPF integration is gated on a site-level switch in the carbide-api TOML config (the file mounted into the carbide-api deployment, typically via the carbide-api-site-config-files ConfigMap). Add a [dpf] section and set enabled = true:

1[dpf]
2enabled = true
3docker_image_pull_secret = "nico-pull-secret"

docker_image_pull_secret is an optional parameter that specifies the name of the Kubernetes Secret used to pull service container images for NICo services. If this field is omitted, NICo defaults to using the dpf-pull-secret for image pulls. In this scenario, ensure that the dpf-pull-secret is configured with a legacy NGC API key for better compatibility.

[dpf].services.* sub-tables can additionally override the Helm chart and container image of each mandatory DPUService that carbide-api deploys (dts, doca_hbn, dpu_agent, dhcp_server, fmds, otel). All of these have built-in defaults; override them only when pinning to a non-default version or registry. Each entry has the same shape:

1[dpf.services.<service>]
2name = "<logical service name>"
3helm_repo_url = "<helm repository URL>"
4helm_chart = "<helm chart name>"
5helm_version = "<helm chart version>" # empty → CI default
6docker_repo_url = "<image registry+repo>"
7docker_image_tag = "<image tag>" # empty → CI default
8docker_image_pull_secret = "dpf-pull-secret"

Field reference (all under [dpf]):

TOML keyTypeDefaultMeaning
enabledboolfalseMaster switch. Must be true to use DPF-based provisioning.
services.<svc>tableper-service defaultsHelm/image overrides for each mandatory DPUService.

Notes:

  • The DPF operator namespace (dpf-operator-system) and the kubeconfig used to talk to the host cluster are not configured here — carbide-api uses its in-cluster ServiceAccount and the fixed dpf-operator-system namespace.

3.6. Mark hosts as DPF-managed in expected machines

Whether a given host is provisioned via DPF or via iPXE is decided per host, in the expected machines list that NICo loads on startup. The relevant field is is_dpf_enabled on each expected-machine entry. A host is provisioned via DPF only when both of the following are true:

  1. [dpf].enabled = true in the site config (section 3.5), and
  2. is_dpf_enabled = true on that host’s expected-machine entry.

There are several operator paths that can set this field. They are described below in the order an operator typically uses them.

3.6.a. carbide-admin-cli expected-machine add — create a new entry

Adds a new expected-machine row. --dpf-enabled is optional; omitting it stores false.

$carbide-admin-cli expected-machine add \
> --bmc-mac-address 1a:1b:1c:1d:1e:1f \
> --bmc-username admin \
> --bmc-password secret \
> --chassis-serial-number CHASSIS-SN-001 \
> --dpf-enabled true

3.6.b. carbide-admin-cli expected-machine patch — partial update via flags

Updates an existing entry in place. The lookup key is --bmc-mac-address (or --id <UUID>). Omitting --dpf-enabled preserves the existing value.

$carbide-admin-cli expected-machine patch \
> --bmc-mac-address 1a:1b:1c:1d:1e:1f \
> --chassis-serial-number CHASSIS-SN-001 \
> --dpf-enabled true

3.6.c. carbide-admin-cli expected-machine update --filename — single-host update from JSON

Updates one entry from a JSON file. The JSON shape uses chassis_serial_number (not serial_number) and any field omitted from the file is preserved server-side.

em.json:

1{
2 "bmc_mac_address": "1a:1b:1c:1d:1e:1f",
3 "bmc_username": "admin",
4 "bmc_password": "secret",
5 "chassis_serial_number": "CHASSIS-SN-001",
6 "dpf_enabled": true
7}
$carbide-admin-cli expected-machine update --filename em.json

This is the most ergonomic path for “toggle DPF on one already-existing expected machine without touching anything else.”

3.6.d. carbide-admin-cli expected-machine replace-all --filename — destructive full reload

Wipes the entire expected_machines table and re-creates it from the file. The file shape is a wrapper object whose expected_machines array uses the same per-entry shape as update:

em-all.json:

1{
2 "expected_machines": [
3 {
4 "bmc_mac_address": "1a:1b:1c:1d:1e:1f",
5 "bmc_username": "admin",
6 "bmc_password": "secret",
7 "chassis_serial_number": "CHASSIS-SN-001",
8 "dpf_enabled": true
9 }
10 ]
11}
$carbide-admin-cli expected-machine replace-all --filename em-all.json

Important: this is not a merge. Any expected-machine row that is not present in the file is deleted. Each entry is then re-created via the same path as add, so any entry whose dpf_enabled is omitted is re-inserted with dpf_enabled = false.

3.6.e. Quick reference

GoalPath
Add a new host with DPF enabledcarbide-admin-cli expected-machine add … --dpf-enabled true
Flip DPF on an existing entry, preserving everything elsecarbide-admin-cli expected-machine update --filename em.json
Flip DPF inline with one or more other fieldscarbide-admin-cli expected-machine patch … --dpf-enabled true
Replace the entire inventorycarbide-admin-cli expected-machine replace-all --filename em-all.json
Inspect current valuecarbide-admin-cli expected-machine show <bmc-mac>

3.7 Enabling DPF for Existing (Ingested) Nodes

You can enable the DPF flag on an already discovered host without force-deleting or recreating it by using:

$carbide-admin-cli dpf enable <host-id>

After changing the DPF status for a host in this way, you should trigger a reprovisioning for all the DPUs under a host (using its host ID). For environments where a host has multiple DPUs, make sure to trigger reprovisioning for all DPUs under the host; otherwise, NICo will not transition the node to DPF-managed status.

Note: The carbide-admin-cli dpf enable command updates the DPF flag only for the currently ingested machine. If you later force-delete the host, this change is lost—on rediscovery, the DPF setting will revert to whatever is present in your expected_machines database.


4. Restart carbide-api to create the DPF initialization objects

Once everything in sections 1–3 is in place, carbide-api must be (re)started. DPF initialization in carbide-api is startup-only: the [dpf] config is read once when the process comes up, and that is the only point at which the DPF initialization objects are created in the host cluster.

On startup with [dpf].enabled = true, carbide-api creates the following objects in the dpf-operator-system namespace:

  • a Secret (bmc-shared-password) holding the shared BMC password,
  • a BFB CR named bf-bundle-<sha256([dpf].bfb_url)>,
  • a DPUFlavor CR named after [dpf].flavor_name,
  • a set of DPUServiceInterface, DPUServiceTemplate, DPUServiceConfiguration, and DPUServiceNAD CRs — one per mandatory DPUService (dts, doca-hbn, carbide-dpu-agent, carbide-dhcp-server, carbide-fmds, carbide-otelcol),
  • a DPUDeployment CR named after [dpf].deployment_name, which references the BFB, the DPUFlavor, and the service templates above, and which the DPF operator then reconciles into actual DPUService and per-DPU resources, and spec.provisioningController.bfCFGTemplateConfigMap).

Because this path runs only at process start, any change to [dpf] — enabling DPF for the first time, changing the BFB URL, renaming the DPUDeployment/DPUFlavor, or pinning a different chart/image version under [dpf.services.*]requires a carbide-api restart for the new configuration to take effect.


Appendix: carbide-admin-cli dpf command reference

carbide-admin-cli ships a top-level dpf subcommand group for inspecting and toggling DPF state on already-ingested hosts and for diffing the running DPF service stack against the configured one. The full set is listed below.

Important: All dpf enable changes are written to the machine’s metadata only. They are wiped on force-delete and on rediscovery the host reverts to whatever its expected-machine entry says. To persist the per-host DPF setting, update the expected-machines table (see section 3.6). This is useful when you want to reprovision a host that was not previously managed by DPF, using the DPF framework.

dpf enable — turn DPF on for a host

$carbide-admin-cli dpf enable <host-machine-id>
ArgumentRequiredNotes
<host-machine-id>yesMust be a host machine id; DPU ids are rejected.

Sets machines.dpf.enabled = true on the given host’s runtime row by calling the ModifyDPFState RPC.

dpf show — inspect DPF state for one or all hosts

$# One host
$carbide-admin-cli dpf show <host-machine-id>
$
$# All hosts (paginated by --page-size)
$carbide-admin-cli dpf show
ArgumentRequiredNotes
<host-machine-id>noIf omitted, lists DPF state for every host. DPU ids are rejected.

Output for a single host prints Enabled and Used For Ingestion flags; the multi-host form prints a table with one row per host. DPUs are excluded from the all-hosts list.

dpf snapshot — dump DPF CRs for a host

$carbide-admin-cli dpf snapshot <host-machine-id>
ArgumentRequiredNotes
<host-machine-id>yesMust be a host machine id; DPU ids are rejected.

Calls the GetDpfHostSnapshot RPC and prints the DPUNode, DPUDevice, and DPU CRs that DPF currently has for the given host. Useful for diagnosing why a host is stuck during DPF-based provisioning.

dpf service-version (alias: sv) — diff configured vs. deployed services

$carbide-admin-cli dpf service-version
$# or
$carbide-admin-cli dpf sv

No arguments. Prints a table comparing each configured DPF service ([dpf.services.*] from the site config if given or read it from the compile time version) against what is actually deployed in the cluster:

ColumnMeaning
ServiceLogical service name (dts, doca-hbn, …).
Config Helm VersionHelm chart version used by NICo.
Live Helm VersionHelm chart version currently deployed; suffixed with (match) or (DIFFERS), or n/a if not deployed.
Config Docker TagImage tag used by NICo (- if unset).
Live Docker TagImage tag currently deployed; suffixed with (match) or (DIFFERS), or n/a if not deployed.

A DIFFERS row indicates the running stack does not match the carbide-api config and that a carbide-api restart (section 4) is needed to reconcile the configured versions onto the cluster.

Quick reference

GoalCommand
Turn DPF on for an already-discovered host (transient)carbide-admin-cli dpf enable <host-id>
Show DPF state for one hostcarbide-admin-cli dpf show <host-id>
List DPF state for all hostscarbide-admin-cli dpf show
Snapshot DPF CRs for a hostcarbide-admin-cli dpf snapshot <host-id>
Diff configured vs. deployed DPF service versionscarbide-admin-cli dpf service-version