DPF Setup for NICo Integration
Introduction
NICo supports two ways of provisioning DPUs:
- iPXE based
- DPF based
This manual covers deployment of DPF based provisioning as it is used by NICo. It assumes that a working Kubernetes cluster is already available, and is intentionally agnostic to the specific cluster implementation (kubeadm, k3s, RKE2, managed clouds, etc.)—any conformant cluster that satisfies the DPF prerequisites is acceptable.
This guide is not a replacement for the official DPF documentation. The authoritative source for installing and configuring DPF is the upstream guide:
NICo is designed to follow the Zero-Trust use case detailed in the DPF documentation: DPF Zero-Trust Mode - HBN Usecase.
You should follow that guide as the base. The instructions below only describe the deltas, additions, and tweaks that need to be applied on top of the official DPF flow so that NICo can integrate with the resulting DPF installation. This manual is based on DPF 26.04; minor adjustments may be necessary on other versions and on environments other than a development setup.
The guide is organized into the following sections:
- Prerequisites — work that must be done before installing DPF.
- DPF Installation — NICo-relevant notes when installing the DPF operator.
- Post-Installation Configuration — the cluster state and NICo configuration that must be in place after DPF is installed and before NICo starts.
- Restart carbide-api — what NICo creates on startup, and why a restart is required to apply DPF config changes.
Note: NICo expects DPF to be installed and configured on the same Kubernetes cluster where NICo (the controller) runs.
1. Prerequisites
The official DPF guide lists a set of cluster-level prerequisites (Argo CD, cert-manager, Kamaji etc.). Follow that guide for those components.
NICo reuses several of those same components (notably Argo CD and cert-manager). If they are already installed for NICo, do not reinstall them — only configure the missing pieces and adapt the existing installations so DPF can use them. The subsections below cover the prerequisite configuration that is specific to a NICo + DPF deployment.
1.1. Create the DPF operator namespace
All DPF operator workloads, secrets, ConfigMaps, and CRs live in the
dpf-operator-system namespace. Create it idempotently:
1.2. Image pull and helm repository credentials
Access to the DPF staging Helm chart and related container images requires authentication through NVIDIA NGC. Both the DPF operator and the workloads it deploys will need credentials for pulling Helm charts and container images from private registries. For detailed instructions, see: https://docs.nvidia.com/networking/display/dpf25101/using-private-registries.
1.2.a. hbn-user-password Secret
A random local credential pair used by the HBN (Host-Based Networking) DPUService, which runs FRR on the DPU. The DPF operator picks this Secret up by label.
The dpu.nvidia.com/image-pull-secret="" label is a DPF convention that tells
the operator “propagate this Secret into DPUService image-pull secrets.” The
label is reused even though this is not strictly an image-pull Secret — DPF’s
controllers selector-match on this label to mirror Secrets onto the DPU
cluster.
1.2.b. dpf-pull-secret docker-registry Secret
Credentials for nvcr.io, used by the DPF operator and by the operands it
deploys to pull staging images.
1.2.c. Secret to pull NICo docker service images
Credentials for nvcr.io, used by the DPF operator to download NICo
service images.
1.2.d. Argo CD repository Secrets for Helm charts
DPF pulls several Helm charts via Argo CD. Apply the following Secrets so that Argo CD can authenticate to the NGC Helm repositories:
Each Secret is labelled argocd.argoproj.io/secret-type: repository, which is
how Argo CD discovers Helm repositories.
Important: the url field must not end with a /, as any difference in the url (including an extra slash) will prevent Argo CD from matching the repository to the correct Secret.
1.3. Cert-manager policy and RBAC for DPF
DPF relies on cert-manager to mint short-lived certificates. If the cluster
runs approver-policy (CRD policy.cert-manager.io/CertificateRequestPolicy),
no CSR will be approved unless a matching policy whitelists it, and DPF’s
CSRs will hang in Pending indefinitely.
Two objects must therefore be installed:
- A
CertificateRequestPolicythat is permissive for thedpf-operator-systemnamespace. - A
ClusterRole+ClusterRoleBindinggranting cert-manager itself theuseverb on that policy.
Note: The policy and role below use wildcard (
*) values for convenience. In production, the exact set of allowed names, SANs, and usages should be tightened with help from the DPF team.
policy.yaml
This allows any CertificateRequest in the dpf-operator-system namespace,
against any issuer, with any SAN (DNS / IP / URI / email), CA or not, with the
listed usages.
rbac-role.yaml
Without this binding cert-manager’s controller cannot reference the policy and all DPF CSRs will hang in pending.
2. DPF Installation
Follow the upstream DPF installation guide for the actual install procedure:
When installing the DPF operator chart, two parameter overrides are required for a NICo-integrated deployment. The example command below illustrates how to set them:
NICo-specific notes on the parameters:
enableNodeFeatureRules=false— the chart’s bundledNodeFeatureRuleresources are disabled because nodes are labeled via NFD’s own configuration (relying on PCI class0200).imagePullSecrets[0].name=dpf-pull-secret— ties the operator’s pods to the pull Secret created in step 1.2.b so that staging images can be pulled.
Adjust REGISTRY and TAG to the version of DPF you are deploying.
3. Post-Installation Configuration (before NICo starts)
Once the DPF operator is running, the following objects must be applied before NICo is started. They configure the DPF operator for NICo’s provisioning model and grant the orchestrator the access it needs.
3.1. Cluster-wide RBAC for the NICo orchestrator
The NICo orchestrator (the carbide-api ServiceAccount in NICo’s default
deployment) needs to read and write across namespaces, including
dpf-operator-system and the per-DPU namespaces. Grant it cluster-admin via
a ClusterRoleBinding:
3.2. DPFOperatorConfig
This is the operator-level CR that tells DPF how to behave in a NICo environment. For more information about the available fields and their details, refer to the official DPF guide.
Field-by-field:
3.3. DPUCluster
The DPUCluster CR defines the Kubernetes control plane that DPU nodes will join. The interface and vip fields must be customized for the environment. For more information about the available fields and their details, refer to the official DPF guide.
Field-by-field:
3.4. VIP LoadBalancer Service and Endpoints
This step exposes the Kamaji cluster IP so it is routable from the DPUs. It may not be required in environments where routing to the VIP is already in place; in that case skip it.
The Service uses a fixed loadBalancerIP matching the VIP set in the DPUCluster above. Replace the loadBalancerIP value before applying.
Note: It only applies for MetalLB-managed deployments.
What this does and why it looks unusual:
- The
Serviceis typeLoadBalancerwith a fixedloadBalancerIP(the same VIP used by theDPUClusterkeepalived). Themetallb.io/address-pool: REPLACE_MEannotation should be updated with a correct pool name. It tells MetalLB to pull the IP from the updated pool defined elsewhere. - A manually-created
Endpointsobject with a single dummy RFC 5737 IP (192.0.2.10) is created with the same name as the Service. This is a Kubernetes idiom: when anEndpointsresource has the same name as a Service that has no selector, the kubelet uses those Endpoints verbatim. Putting a dummy IP here means: “reserve the VIP via MetalLB, but route nothing — keepalived is the actual front-end.” - Net effect: MetalLB advertises the VIP to the network so external machines (DPUs, BMCs) can reach it, while keepalived handles the actual TCP termination.
If your environment uses a different LoadBalancer mechanism (kube-vip, a cloud-provider LB, etc.), use it to expose the VIP and point the DPUCluster’s keepalived.vip at the same address.
3.5. Enable DPF in the NICo site config
DPF integration is gated on a site-level switch in the carbide-api TOML config
(the file mounted into the carbide-api deployment, typically via the
carbide-api-site-config-files ConfigMap). Add a [dpf] section and set
enabled = true:
docker_image_pull_secret is an optional parameter that specifies the name of the Kubernetes Secret used to pull service container images for NICo services. If this field is omitted, NICo defaults to using the dpf-pull-secret for image pulls. In this scenario, ensure that the dpf-pull-secret is configured with a legacy NGC API key for better compatibility.
[dpf].services.* sub-tables can additionally override the Helm chart and
container image of each mandatory DPUService that carbide-api deploys
(dts, doca_hbn, dpu_agent, dhcp_server, fmds, otel). All of these
have built-in defaults; override them only when pinning to a non-default
version or registry. Each entry has the same shape:
Field reference (all under [dpf]):
Notes:
- The DPF operator namespace (
dpf-operator-system) and the kubeconfig used to talk to the host cluster are not configured here — carbide-api uses its in-cluster ServiceAccount and the fixeddpf-operator-systemnamespace.
3.6. Mark hosts as DPF-managed in expected machines
Whether a given host is provisioned via DPF or via iPXE is decided per host,
in the expected machines list that NICo loads on startup. The relevant
field is is_dpf_enabled on each expected-machine entry. A host is
provisioned via DPF only when both of the following are true:
[dpf].enabled = truein the site config (section 3.5), andis_dpf_enabled = trueon that host’s expected-machine entry.
There are several operator paths that can set this field. They are described below in the order an operator typically uses them.
3.6.a. carbide-admin-cli expected-machine add — create a new entry
Adds a new expected-machine row. --dpf-enabled is optional; omitting it
stores false.
3.6.b. carbide-admin-cli expected-machine patch — partial update via flags
Updates an existing entry in place. The lookup key is --bmc-mac-address
(or --id <UUID>). Omitting --dpf-enabled preserves the existing
value.
3.6.c. carbide-admin-cli expected-machine update --filename — single-host update from JSON
Updates one entry from a JSON file. The JSON shape uses
chassis_serial_number (not serial_number) and any field omitted from the
file is preserved server-side.
em.json:
This is the most ergonomic path for “toggle DPF on one already-existing expected machine without touching anything else.”
3.6.d. carbide-admin-cli expected-machine replace-all --filename — destructive full reload
Wipes the entire expected_machines table and re-creates it from the file.
The file shape is a wrapper object whose expected_machines array uses the
same per-entry shape as update:
em-all.json:
Important: this is not a merge. Any expected-machine row that is not present in the file is deleted. Each entry is then re-created via the same path as
add, so any entry whosedpf_enabledis omitted is re-inserted withdpf_enabled = false.
3.6.e. Quick reference
3.7 Enabling DPF for Existing (Ingested) Nodes
You can enable the DPF flag on an already discovered host without force-deleting or recreating it by using:
After changing the DPF status for a host in this way, you should trigger a reprovisioning for all the DPUs under a host (using its host ID). For environments where a host has multiple DPUs, make sure to trigger reprovisioning for all DPUs under the host; otherwise, NICo will not transition the node to DPF-managed status.
Note: The carbide-admin-cli dpf enable command updates the DPF flag only for the currently ingested machine. If you later force-delete the host, this change is lost—on rediscovery, the DPF setting will revert to whatever is present in your expected_machines database.
4. Restart carbide-api to create the DPF initialization objects
Once everything in sections 1–3 is in place, carbide-api must be (re)started.
DPF initialization in carbide-api is startup-only: the [dpf] config is
read once when the process comes up, and that is the only point at which the
DPF initialization objects are created in the host cluster.
On startup with [dpf].enabled = true, carbide-api creates the following
objects in the dpf-operator-system namespace:
- a
Secret(bmc-shared-password) holding the shared BMC password, - a
BFBCR namedbf-bundle-<sha256([dpf].bfb_url)>, - a
DPUFlavorCR named after[dpf].flavor_name, - a set of
DPUServiceInterface,DPUServiceTemplate,DPUServiceConfiguration, andDPUServiceNADCRs — one per mandatory DPUService (dts,doca-hbn,carbide-dpu-agent,carbide-dhcp-server,carbide-fmds,carbide-otelcol), - a
DPUDeploymentCR named after[dpf].deployment_name, which references the BFB, the DPUFlavor, and the service templates above, and which the DPF operator then reconciles into actualDPUServiceand per-DPU resources, andspec.provisioningController.bfCFGTemplateConfigMap).
Because this path runs only at process start, any change to [dpf] —
enabling DPF for the first time, changing the BFB URL, renaming the
DPUDeployment/DPUFlavor, or pinning a different chart/image version under
[dpf.services.*] — requires a carbide-api restart for the new
configuration to take effect.
Appendix: carbide-admin-cli dpf command reference
carbide-admin-cli ships a top-level dpf subcommand group for inspecting and
toggling DPF state on already-ingested hosts and for diffing the running DPF
service stack against the configured one. The full set is listed below.
Important: All
dpf enablechanges are written to the machine’s metadata only. They are wiped on force-delete and on rediscovery the host reverts to whatever its expected-machine entry says. To persist the per-host DPF setting, update the expected-machines table (see section 3.6). This is useful when you want to reprovision a host that was not previously managed by DPF, using the DPF framework.
dpf enable — turn DPF on for a host
Sets machines.dpf.enabled = true on the given host’s runtime row by calling
the ModifyDPFState RPC.
dpf show — inspect DPF state for one or all hosts
Output for a single host prints Enabled and Used For Ingestion flags; the
multi-host form prints a table with one row per host. DPUs are excluded
from the all-hosts list.
dpf snapshot — dump DPF CRs for a host
Calls the GetDpfHostSnapshot RPC and prints the DPUNode, DPUDevice, and
DPU CRs that DPF currently has for the given host. Useful for diagnosing
why a host is stuck during DPF-based provisioning.
dpf service-version (alias: sv) — diff configured vs. deployed services
No arguments. Prints a table comparing each configured DPF service
([dpf.services.*] from the site config if given or read it from
the compile time version) against what is actually deployed
in the cluster:
A DIFFERS row indicates the running stack does not match the carbide-api
config and that a carbide-api restart (section 4) is needed to reconcile the
configured versions onto the cluster.