DPUService
The DPUService
CRD provides an API for DPF users to deploy applications on DPU nodes using Helm charts.
The following is an example of a DPUService
YAML for the blueman service:
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUService
metadata:
name: doca-blueman-service
namespace: dpf-operator-system
spec:
helmChart:
source:
repoURL: https://helm.ngc.nvidia.com/nvidia/doca
version: 1.0
.8
chart: doca-blueman
serviceDaemonSet:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 2
labels:
dpuservice.dpu.nvidia.com/name: doca-blueman-service
annotations:
dpuservice.dpu.nvidia.com/name: doca-blueman-service
In the above example:
A
DPUService
nameddoca-blueman-service
is created in thedpf-operator-system
namespace.A HelmChart located at
https://helm.ngc.nvidia.com/nvidia/doca
with version1.0.5
and chartdoca-blueman
is used to deploy theDPUService
.A
DaemonSet
is deployed on target DPU nodes. We specify the update strategy, labels, and annotations for theDaemonSet
.
A DPUDeployment resource is the recommended way to deploy and manage DPUServices
in DPF. The DPUDeployment
resource is a higher level abstraction that allows to manage the lifecycle of DPUServices
in the DPU cluster. It provides a way to deploy, upgrade, validate dependencies and resources requirements for DPUServices
.
Helm Chart Configuration
The spec.helmChart
field is used to specify the Helm chart that will be used to deploy the DPUService
. The spec.helmChart.source
field is used to specify the Helm chart repository URL, version, and chart name.
spec:
helmChart:
source:
repoURL: https://helm.ngc.nvidia.com/nvidia/doca
version: 1.0
.8
chart: doca-blueman
The following fields are used to specify the Helm chart:
repoURL
- the URL of the Helm chart repository.OCI
andHTTP
repositories are supported.version
- the version of the Helm chart to be used.chart
- the name of the Helm chart.
It is possible to specify the path to the Helm chart in the repository with the spec.helmChart.source.path
field.
Optionally, the spec.helmChart.source.releaseName
field can be used to specify the release name of the released Helm chart.
Service DaemonSet Values
The spec.serviceDaemonSet
field is used to specify the configuration of the DaemonSet. It represents Helm values to be used during the release of the Helm chart. These values take precedence over the values defined in the Helm chart.
spec:
serviceDaemonSet:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 2
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: "bfb"
operator: In
values: ["dk"
]
labels:
"some-label"
: "some-value"
annotations:
"some-annotation"
: "some-value"
resources:
resources:
memory: 6Gi
nvidia.com/bf_sf: 3
The following fields are used to specify the DaemonSet configuration:
updateStrategy
- the update strategy for the DaemonSet. This should be a valid KubernetesDaemonSetUpdateStrategy
object.nodeSelector
- the node selector for the DaemonSet. This field is used to select the nodes where the DaemonSet's Pods will be deployed.labels
- the labels to be applied to the DaemonSet's Pods.annotations
- the annotations to be applied to the DaemonSet's Pods.resources
- the resource requests and limits for the DaemonSet's Pods.
For more information on how to write a Helm chart for a DPUService
, see the DPUService Development Guide.
Deploy a DPUService in the Host Cluster
It is possible to deploy a DPUService
in the host cluster, i.e. the cluster where the DPF operator is running. This can be done by setting the spec.deployInCluster
field in the DPUService
manifest to true
.
DPUServiceInterface dependencies
A DPUService
can have dependencies on DPUServiceInterface
resources. These resources represent interfaces that are required by the DPUService
to function properly.
The DPUServiceInterface
resources must exist for the DPUService
to be deployed successfully. If the DPUServiceInterface
resources are not found in the cluster, the DPUService
will not be deployed and a condition will be set to false
with a reason on the status
field of the DPUService
resource.
To define a dependency DPUServiceInterface
resources, the spec.interfaces
field of the DPUService
resource must be set to the names of the DPUServiceInterfaces
.
spec:
interfaces:
- p0-sif
- p1-if
- app-if
ConfigPorts
A DPUService
workload can expose multiple ports to the host cluster. The configPorts
field is used to specify the ports that should be exposed from the DPU nodes to the host cluster.
spec:
configPorts:
serviceType: NodePort
ports:
- name: port1
protocol: TCP
nodePort: 30001
- name: port2
protocol: TCP
nodePort: 30000
The spec.configPorts.serviceType
field defines the type of service to be created for the DPUService
. It can be NodePort
, ClusterIP
, or None
with NodePort
being the default value.
The spec.configPorts.ports
field is used to specify the ports that should be exposed from the DPU nodes to the host cluster. Each port should have the following fields:
name
- the name of the port. This field is a unique identifier and must reference a port defined in the Helm chart.port
- the port that will be exposed on the host cluster for use by in-cluster workloads.protocol
- the protocol to be used for the port. It can beTCP
orUDP
.nodePort
- the external port to be exposed on the host cluster. This field is optional and is used only when thespec.configPorts.serviceType
field is set toNodePort
.
Upon reconciliation, the DPUService
controller will retrieve the list of services that exist in the DPU cluster for this DPUService
. It will retrieve the ports that are matching the ports defined in the spec.configPorts.ports
field. It uses the name
field to match the ports. If the ports are found, the nodePort
values will be collected. The controller will then make the collected nodePorts
accessible from the host cluster by creating a new Service
resource.
For more information on how to write a Helm chart for a DPUService
, in order to expose ports, see the DPUService development guide.
The DPUService
resource can be updated by changing the fields in the spec
section of the resource. The controller will detect the changes and update the DPUService
accordingly. The DPUService
will not be recreated during the update process.
Waiting for Ready
When a DPUService
is created, it may take some time for the workload to be ready.
It is possible to wait for a DPUService
to be ready by using the kubectl wait
command:
$ kubectl wait --for
=condition=Ready dpuservice/<dpuservice-name> -n <namespace>
Suspending and Resuming
Suspending a DPUService
The need to temporarily suspend a DPUService
may arise when performing maintenance or troubleshooting. The DPUService
can be suspended by setting the spec.paused
field to true
.
spec:
paused: true
Or by using the kubectl patch
command:
$ kubectl patch dpuservice <dpuservice-name> \
-n <namespace> \
--field-manager=dpf-client-side-apply \
--type
='json'
\
-p='[{"op": "replace", "path": "/spec/paused", "value": true}]'
Note: When a DPUService
is suspended, the controller will not reconcile the resource until the spec.paused
field is set to false
. This means that the DPUService
can neither be updated nor deleted while it is suspended.
Resuming a DPUService
To resume a suspended DPUService
, set the spec.paused
field to false
.
spec:
paused: false
Or by using the kubectl patch
command:
$ kubectl patch dpuservice <dpuservice-name> \
-n <namespace> \
--field-manager=dpf-client-side-apply \
--type
='json'
\
-p='[{"op": "replace", "path": "/spec/paused", "value": false}]'
Debugging DPUServices
There are several ways to debug DPUServices
in DPF. The recommended way is to use the dpfctl command line tool to gather information about the DPUServices
in the DPU cluster.
$ dpfctl describe dpuservices --show-conditions=all
NAME NAMESPACE READY REASON SINCE MESSAGE
DPFOperatorConfig/dpfoperatorconfig dpf-operator-system True Success 2h
│ ├─ImagePullSecretsReconciled True Success 3h
│ ├─SystemComponentsReady True Success 2h
│ └─SystemComponentsReconciled True Success 2h
├─DPUServiceCredentialRequests
│ └─DPUServiceCredentialRequest/servicesetcontroller dpf-operator-system True Success 3h
│ ├─SecretReconciled True Success 3h
│ └─ServiceAccountReconciled True Success 3h
└─DPUServices
├─9 DPUServices... dpf-operator-system True Success 3h See flannel, multus, nvidia-k8s-ipam, ovs-cni, ovs-helper, servicechainset-controller,
│ servicechainset-rbac-and-crds, sfc-controller, sriov-device-plugin
├─DPUService/blueman-l2xsl dpf-operator-system False Pending 2h The following conditions are not ready: ApplicationsReady
│ ├─ApplicationPrereqsReconciled True Success 2h
│ ├─ApplicationsReady False Pending 2h Application is not ready (Sync: OutOfSync, Health: Healthy). Run 'kubectl describe application
│ │ dpu-cplane-tenant1-blueman-l2xsl -n dpf-operator-system' for
details.
│ ├─ApplicationsReconciled True Success 2h
│ ├─ConfigPortsReconciled True Success 2h
│ └─DPUServiceInterfaceReconciled True Success 2h
└─DPUService/dts-rrjlp dpf-operator-system False Pending 2h The following conditions are not ready: ApplicationsReady
├─ApplicationPrereqsReconciled True Success 2h
├─ApplicationsReady False Pending 2h Application is not ready (Sync: OutOfSync, Health: Healthy). Run 'kubectl describe application
│ dpu-cplane-tenant1-dts-rrjlp -n dpf-operator-system' for
details.
├─ApplicationsReconciled True Success 2h
├─ConfigPortsReconciled True Success 2h
└─DPUServiceInterfaceReconciled True Success 2h
For a better control of maintenance and down-time, the cluster can be logically divided into several "zones". Each zone can have its own set of DPU services, that can be upgraded individually, affecting only the specific zone. The creation of zones for DPU services is done by adding labels on the nodes in the DPU cluster and then using them with the DPU service YAML:
Create a specific DPU Set for worker nodes labeled as "e2e.servers/dk=true", by adding the "cluster -> nodeLabels" section, assign their DPUs the label "bfb=dk" (on the DPU cluster).
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUSet
metadata:
name: dpuset-dk
namespace: dpf-operator-system
spec:
dpuNodeSelector:
matchLabels:
e2e.servers/dk: "true"
strategy:
rollingUpdate:
maxUnavailable: "10%"
type: RollingUpdate
dpuTemplate:
spec:
dpuFlavor: dpf-provisioning-hbn-ovn
bfb:
name: bf-bundle-dk-ga
nodeEffect:
taint:
key: "dpu"
value: "provisioning"
effect: NoSchedule
cluster:
nodeLabels:
bfb: "dk"
Then use the assigned label to create an HBN DPU Service for these specific nodes (under the "nodeSelector" section):
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUService
metadata:
name: doca-hbn-dk
namespace: dpf-operator-system
spec:
serviceID: doca-hbn
interfaces:
- p0-sf-dk
- p1-sf-dk
- app-sf-dk
serviceDaemonSet:
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: "bfb"
operator: In
values: ["dk"
]
annotations:
k8s.v1.cni.cncf.io/networks: |-
[
{"name"
: "iprequest"
, "interface"
: "ip_lo"
, "cni-args"
: {"poolNames"
: ["loopback"
], "poolType"
: "cidrpool"
}},
{"name"
: "iprequest"
, "interface"
: "ip_pf2dpu3"
, "cni-args"
: {"poolNames"
: ["pool1"
], "poolType"
: "cidrpool"
, "allocateDefaultGateway"
: true
}}
]
helmChart:
source:
repoURL: https://helm.ngc.nvidia.com/nvidia/doca
version: 1.0
.1
chart: doca-hbn
values:
image:
repository: nvcr.io/nvidia/doca/doca_hbn
tag: 2.4
.1
-doca2.9.1
resources:
memory: 6Gi
nvidia.com/bf_sf: 3
configuration:
perDPUValuesYAML: |
- hostnamePattern: "*"
values:
bgp_autonomous_system: 65111
bgp_peer_group: hbn
startupYAMLJ2: |
- header:
model: BLUEFIELD
nvue-api-version: nvue_v1
rev-id: 1.0
version: HBN 2.4
.0
- set:
interface
:
lo:
ip:
address:
{{ ipaddresses.ip_lo.ip }}/32
: {}
type: loopback
p0_if,p1_if:
type: swp
link:
mtu: 9000
pf2dpu3_if:
ip:
address:
{{ ipaddresses.ip_pf2dpu3.cidr }}: {}
type: swp
link:
mtu: 9000
router:
bgp:
autonomous-system: {{ config.bgp_autonomous_system }}
enable: on
graceful-restart:
mode: full
router-id: {{ ipaddresses.ip_lo.ip }}
vrf:
default
:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
ipv6-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
p0_if:
peer-group: {{ config.bgp_peer_group }}
type: unnumbered
p1_if:
peer-group: {{ config.bgp_peer_group }}
type: unnumbered
path-selection:
multipath:
aspath-ignore: on
peer-group:
{{ config.bgp_peer_group }}:
remote-as: external
You can do the same for the additional required YAMLs (interfaces and chains):
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceInterface
metadata:
name: app-sf-dk
namespace: dpf-operator-system
spec:
template:
spec:
nodeSelector:
matchLabels:
bfb: "dk"
template:
metadata:
labels:
svc.dpu.nvidia.com/interface
: "app_sf"
svc.dpu.nvidia.com/service: doca-hbn
spec:
interfaceType: service
service:
serviceID: doca-hbn
network: mybrhbn
interfaceName: pf2dpu3_if
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceInterface
metadata:
name: p0-sf-dk
namespace: dpf-operator-system
spec:
template:
spec:
nodeSelector:
matchLabels:
bfb: "dk"
template:
metadata:
labels:
svc.dpu.nvidia.com/interface
: "p0_sf"
svc.dpu.nvidia.com/service: doca-hbn
spec:
interfaceType: service
service:
serviceID: doca-hbn
network: mybrhbn
interfaceName: p0_if
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceInterface
metadata:
name: p1-sf-dk
namespace: dpf-operator-system
spec:
template:
spec:
nodeSelector:
matchLabels:
bfb: "dk"
template:
metadata:
labels:
svc.dpu.nvidia.com/interface
: "p1_sf"
svc.dpu.nvidia.com/service: doca-hbn
spec:
interfaceType: service
service:
serviceID: doca-hbn
network: mybrhbn
interfaceName: p1_if
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceChain
metadata:
name: hbn-to-fabric-dk
namespace: dpf-operator-system
spec:
template:
spec:
nodeSelector:
matchLabels:
bfb: "dk"
template:
spec:
switches:
- ports:
- serviceInterface:
matchLabels:
uplink: p0
- serviceInterface:
matchLabels:
svc.dpu.nvidia.com/service: doca-hbn
svc.dpu.nvidia.com/interface
: "p0_sf"
- ports:
- serviceInterface:
matchLabels:
uplink: p1
- serviceInterface:
matchLabels:
svc.dpu.nvidia.com/service: doca-hbn
svc.dpu.nvidia.com/interface
: "p1_sf"
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceChain
metadata:
name: ovn-to-hbn-dk
namespace: dpf-operator-system
spec:
template:
spec:
nodeSelector:
matchLabels:
bfb: "dk"
template:
spec:
switches:
- ports:
- serviceInterface:
matchLabels:
svc.dpu.nvidia.com/service: doca-hbn
svc.dpu.nvidia.com/interface
: "app_sf"
- serviceInterface:
matchLabels:
port: ovn