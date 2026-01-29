On This Page
- Overview
- Key Features
- DPUNode Specification
- Conditions
- Example Usage
- Reboot Methods
- Integration with Kubernetes
- Lifecycle Management
- Monitoring and Troubleshooting
- Related Resources
DPUNode
The DPUNode is a Kubernetes CRD that represents a physical host node containing one or more DPU (Data Processing Unit) devices in the DOCA Platform Framework (DPF). It provides node-level management capabilities for DPU provisioning, reboot control, and integration with Kubernetes clusters.
The DPUNode resource serves as a bridge between physical host nodes and DPU devices, enabling centralized management of DPU provisioning and host operations. It defines how DPUs should be provisioned on a specific node and how the host should be managed during DPU operations.
Node-Level Management: Manages DPU operations at the host node level
Reboot Control: Configurable host reboot methods (gNOI, external, script)
DMS Integration: Integration with Device Management Service (DMS)
DPU Association: Links multiple DPU devices to a single node
Kubernetes Integration: Optional integration with Kubernetes Node objects
DPUNodeSpec
The
spec section defines the desired configuration for the DPU node:
Field
Type
Required
Description
NodeRebootMethod
No
Method for rebooting the host (default: gNOI)
DMSAddress
No
IP and port for DMS communication
[]DPURef
No
List of DPU devices attached to this node
NodeRebootMethod
Defines how the host should be rebooted during DPU operations:
Field
Type
Required
Description
GNOI
No
Use DPU's DMS interface to reboot the host
External
No
Reboot via external means (not controlled by DPU controller)
Script
No
Reboot by executing a custom script
DMSAddress
Configuration for Device Management Service communication:
Field
Type
Required
Description
string
Yes
IP address in IPv4 format
uint16
Yes
Port number (minimum: 1)
DPURef
Reference to a DPU device:
Field
Type
Required
Description
string
Yes
Name of the DPU device
DPUNodeStatus
The
status section contains the observed state of the DPU node:
Field
Type
Description
array
Array of condition objects describing node state
string
Interface used for DPU installation (gNOI or redfish)
string
Name of the Kubernetes Node object (immutable)
bool
Indicates if the node is currently rebooting
The DPUNode resource uses several condition types to track its state:
Ready: The DPU node is ready for operations
InvalidDPUDetails: The DPU details provided are invalid
DPUNodeRebootInProgress: The DPUNode is in the process of rebooting
DPUUpdateInProgress: The DPU is being updated
NeedHostAgentUpgrade: The host agent needs to be upgraded
OOBBridgeConfigured: The out-of-band bridge (br-dpu) is configured
RshimAvailable: The rshim interface is available
Basic DPUNode with gNOI Reboot
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUNode
metadata:
name: dpu-node-
001
namespace: dpf-operator-system
spec:
nodeRebootMethod:
gNOI: {}
nodeDMSAddress:
ip:
"192.168.1.100"
port:
443
dpus:
- name: dpu-device-
001
- name: dpu-device-
002
DPUNode with External Reboot
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUNode
metadata:
name: dpu-node-
002
namespace: dpf-operator-system
spec:
nodeRebootMethod:
external: {}
dpus:
- name: dpu-device-
003
DPUNode with Custom Script Reboot
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUNode
metadata:
name: dpu-node-
003
namespace: dpf-operator-system
spec:
nodeRebootMethod:
script:
name: custom-reboot-script
dpus:
- name: dpu-device-
004
Custom Reboot Script ConfigMap
---
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-reboot-script
namespace: dpf-operator-system
data:
pod-template: |
apiVersion: v1
kind: Pod
metadata:
name: custom-reboot-pod
namespace: dpf-operator-system
spec:
containers:
- name: reboot-container
image: ubuntu:
20.04
command: [
"/bin/bash"]
args:
- -c
- |
echo
"Performing custom reboot procedure..."
# Add your custom reboot logic here
# For example: IPMI commands, SSH to BMC, etc.
sleep
10
exit
0
restartPolicy: Never
gNOI (Default)
Uses the DPU's Device Management Service interface to reboot the host. This is the recommended method for most deployments.
Advantages: * Integrated with DPU management * Reliable and consistent * No external dependencies
Requirements: * DMS must be accessible * Valid DMS address configuration
External
Reboots the host via external means not controlled by the DPU controller. This method requires manual intervention or external automation.
Use Cases: * Custom power management systems * IPMI-based reboots * Cloud provider APIs
Requirements: * External reboot mechanism must be available * Manual intervention may be required
Script
Executes a custom script to reboot the host. The script is defined in a ConfigMap and executed as a Kubernetes Job.
Use Cases: * Custom reboot procedures * Integration with existing automation * Complex reboot workflows
Requirements: * ConfigMap with pod template * Script must exit successfully * Proper RBAC permissions
Node Association
DPUNode can optionally be associated with a Kubernetes Node object:
status:
kubeNodeRef:
"worker-node-001"
This association enables: * Node-level operations (draining, tainting) * Integration with Kubernetes scheduling * Resource management alignment
Annotations
DPUNode supports the following annotation for external reboot requirements:
metadata:
annotations:
provisioning.dpu.nvidia.com/dpunode-external-reboot-required:
"true"
Creation
DPUNode resources are typically created: * Manually: By administrators for known nodes * Automatically: Via discovery processes * Via DPUSet: As part of bulk node management
Updates
Most fields in DPUNode can be updated, but some restrictions apply: *
kubeNodeRef is immutable once set *
dpus list can be modified to add/remove devices
Deletion
DPUNode resources are protected by a finalizer (
provisioning.dpu.nvidia.com/dpunode-protection) to prevent deletion while DPUs are in use.
Checking Node Status
# Get all DPUNode resources
kubectl get dpunodes -n dpf-operator-system
# Get detailed information about a specific node
kubectl describe dpunode dpu-node-001 -n dpf-operator-system
# Check node conditions
kubectl get dpunode dpu-node-001 -n dpf-operator-system -o jsonpath=
'{.status.conditions}'
Common Issues
Invalid DMS Address: Verify IP and port configuration
DPU Not Found: Ensure referenced DPUDevice resources exist
Reboot Failures: Check reboot method configuration and permissions
Script Execution Errors: Verify ConfigMap and script syntax
Status Monitoring
# Check if node is ready
kubectl get dpunode dpu-node-001 -n dpf-operator-system -o jsonpath=
'{.status.conditions[?(@.type=="Ready")].status}'
# Check reboot status
kubectl get dpunode dpu-node-001 -n dpf-operator-system -o jsonpath=
'{.status.rebootInProgress}'
# Check install interface
kubectl get dpunode dpu-node-001 -n dpf-operator-system -o jsonpath=
'{.status.dpuInstallInterface}'
DPUDevice - Individual DPU device management
DPU - DPU provisioning and deployment
DPUSet - Bulk DPU and node management
DPUDiscovery - Automatic DPU discovery