Adding Nodes
Adding Nodes to DPS
Overview
This guide provides step-by-step instructions for adding new compute nodes to an existing DPS (Domain Power Service) configuration. The process involves creating BMC credentials, updating the topology configuration, and ensuring the new nodes are properly integrated into the power management system.
Prerequisites
- DPS server running and accessible
dpsctlinstalled and authenticated- Existing topology already configured and active
- Access to the new node’s BMC (Baseboard Management Controller)
- BMC credentials for the new node
- Device specifications for the new node type (if not already defined)
Step 1: Verify Current Configuration
Before adding new nodes, verify your current DPS configuration:
# Check current topology status
dpsctl topology list
# List current entities
dpsctl topology list-entities
# Check active topology
dpsctl topology list --activeExample Output:
{
"topologies": [
{
"topology_name": "datacenter",
"is_active": true,
"leaf_node_names": ["node001", "node002"]
}
]
}Step 2: Create BMC Credentials Secret
DPS requires BMC credentials to communicate with each node. Create a Kubernetes secret for the new node’s BMC credentials.
Option A: Using kubectl in-line
# Create BMC credentials secret for the new node
kubectl create secret generic node003 \
--namespace dps \
--from-literal=bmc='{"username":"admin","password":"your-secure-password"}' \
--dry-run=client -o yaml | kubectl apply -f -Option B: Using YAML Manifest
Create a file named node003-secret.yaml:
apiVersion: v1
kind: Secret
metadata:
name: node003
namespace: dps
labels:
app: bmc-secret
type: Opaque
stringData:
bmc: |
{
"username": "admin",
"password": "your-secure-password"
}Apply the secret:
kubectl apply -f node003-secret.yamlStep 3: Verify Device Specifications
Ensure the device specification for your new node type exists in DPS:
# List available device specifications
dpsctl device listIf the device specification doesn’t exist, import it:
# Import device specifications (if needed)
dpsctl device upsert devices.yamlExample device specification for DGX GB200:
- type: ComputerSystem
description: NVIDIA GB200 Compute Tray (Bianca)
model: DGX_GB200
spec:
devices:
- type: CPU
model: Grace
count: 2
- type: GPU
model: GB200
count: 4
minLoadWatts: 900
maxLoadWatts: 5640
processorModulesCount: 2Step 4: Create Updated Topology Configuration
Create a new topology file that includes the existing nodes plus the new node(s).
Option A: Export Current Topology and Modify
# Export current topology
dpsctl topology export --topology datacenter --filename current-topology.jsonEdit the exported file to add the new node:
{
"Entities": [
{
"Type": "PowerDomain",
"Name": "PD-A",
"Constraints": {
"PowerValue": {"Value": 1150000, "Type": "W"},
"PowerFactor": 0.9
}
},
{
"Type": "ComputerSystem",
"Model": "DGX_GB200",
"Name": "node001",
"Policy": "Node-High",
"Redfish": {
"@odata.type": "#ComputerSystem.v1_23_0.ComputerSystem",
"@odata.id": "/node001",
"Id": "node001",
"URL": "https://node001-bmc.example.com",
"SecretName": "node001"
}
},
{
"Type": "ComputerSystem",
"Model": "DGX_GB200",
"Name": "node002",
"Policy": "Node-High",
"Redfish": {
"@odata.type": "#ComputerSystem.v1_23_0.ComputerSystem",
"@odata.id": "/node002",
"Id": "node002",
"URL": "https://node002-bmc.example.com",
"SecretName": "node002"
}
},
{
"Type": "ComputerSystem",
"Model": "DGX_GB200",
"Name": "node003",
"Policy": "Node-High",
"Redfish": {
"@odata.type": "#ComputerSystem.v1_23_0.ComputerSystem",
"@odata.id": "/node003",
"Id": "node003",
"URL": "https://node003-bmc.example.com",
"SecretName": "node003"
}
}
],
"Topology": {
"Name": "datacenter",
"Entities": [
{
"Name": "PD-A",
"Children": ["node001", "node002", "node003"]
},
{
"Name": "node001"
},
{
"Name": "node002"
},
{
"Name": "node003"
}
]
}
}Option B: Create New Topology File
Create a new file updated-topology.json with all nodes including the new one.
Step 5: Validate the Updated Topology
Validate the topology file before importing:
# Validate the updated topology
dpsctl topology validate --filename updated-topology.jsonExpected Output:
{
"status": {
"ok": true,
"diag_msg": "Topology validation passed"
}
}If validation fails, fix the errors before proceeding.
Step 6: Update the Topology
Update the existing topology with the new configuration:
# Update the topology with new nodes
dpsctl topology update --filename updated-topology.jsonExpected Output:
{
"status": {
"ok": true,
"diag_msg": "Topology updated successfully"
}
}Step 7: Verify the Update
Verify that the new node has been added to the topology:
# List all entities to confirm the new node is included
dpsctl topology list-entities
# Check topology details
dpsctl topology listExpected Output:
{
"topologies": [
{
"topology_name": "datacenter",
"is_active": true,
"leaf_node_names": ["node001", "node002", "node003"]
}
]
}Step 8: Test Node Connectivity
Test connectivity to the new node’s BMC:
# Test connectivity to the new node
dpsctl check connection --topology datacenter --nodes node003Expected Output:
{
"total_nodes": 1,
"success_nodes": 1
}You can also test connectivity to all nodes in the topology:
# Test connectivity to all nodes
dpsctl check connection --topology datacenterExpected Output:
{
"total_nodes": 3,
"success_nodes": 3
}Step 9: Reactivate the Topology (if needed)
If the topology was deactivated during the update, reactivate it:
# Reactivate the topology
dpsctl topology activate --topology datacenter --ping-hostsFlags:
--ping-hostsTest connectivity of all hosts (recommended)--at-least-percent-hostsMinimum percent of hosts that must be reachable (default: 50)
Step 10: Verify Power Management
Verify that the new node is properly integrated into power management:
# Check node status
dpsctl check status
# Verify power policies are applied
dpsctl policy listTroubleshooting
Common Issues and Solutions
1. BMC Connection Failures
Symptoms:
- Node connection check fails
- Power policy application errors
Solutions:
- Verify BMC credentials are correct
- Check network connectivity to BMC
- Ensure BMC is accessible from DPS server
- Verify BMC Redfish API is enabled
# Test BMC connectivity manually
curl -k -u admin:password https://node003-bmc.example.com/redfish/v12. Device Specification Not Found
Symptoms:
- Validation errors about unknown device type/model
Solutions:
- Import the required device specifications
- Verify the device type and model match existing specifications
# Import device specifications
dpsctl device upsert devices.yaml3. Secret Not Found
Symptoms:
- Authentication errors when connecting to BMC
Solutions:
- Verify the secret name matches the
SecretNamein the topology - Check that the secret is in the correct namespace
- Ensure the secret contains the correct
bmckey
# Verify secret exists
kubectl get secret node003 -n dps
# Check secret contents (be careful with sensitive data)
kubectl get secret node003 -n dps -o jsonpath='{.data.bmc}' | base64 -d4. Topology Update Conflicts
Symptoms:
- Update fails with hash mismatch errors
Solutions:
- Use
--forceflag to bypass hash validation (use with caution) - Export the current topology and merge changes manually
- Ensure no other changes were made to the topology
# Force update (use with caution)
dpsctl topology update --filename updated-topology.json --force5. Power Policy Application Issues
Symptoms:
- Nodes not responding to power policy changes
- Power limit errors
Solutions:
- Verify the node supports the specified power policy plugin
- Check that the BMC supports the required Redfish endpoints
- Ensure the device specification includes the correct
powerPolicyPlugin
Related Documentation
- Managing Topologies - Complete topology management guide
- Credentials and Secrets Configuration - Detailed secrets management
- Device Specifications - Understanding device capabilities
- Entities - Node configuration details
- Redfish API - BMC communication protocols