Adding Nodes

Adding Nodes to DPS

Overview

This guide provides step-by-step instructions for adding new compute nodes to an existing DPS (Domain Power Service) configuration. The process involves creating BMC credentials, updating the topology configuration, and ensuring the new nodes are properly integrated into the power management system.

Prerequisites

  • DPS server running and accessible
  • dpsctl installed and authenticated
  • Existing topology already configured and active
  • Access to the new node’s BMC (Baseboard Management Controller)
  • BMC credentials for the new node
  • Device specifications for the new node type (if not already defined)

Step 1: Verify Current Configuration

Before adding new nodes, verify your current DPS configuration:

# Check current topology status
dpsctl topology list

# List current entities
dpsctl topology list-entities

# Check active topology
dpsctl topology list --active

Example Output:

{
  "topologies": [
    {
      "topology_name": "datacenter",
      "is_active": true,
      "leaf_node_names": ["node001", "node002"]
    }
  ]
}

Step 2: Create BMC Credentials Secret

DPS requires BMC credentials to communicate with each node. Create a Kubernetes secret for the new node’s BMC credentials.

Option A: Using kubectl in-line

# Create BMC credentials secret for the new node
kubectl create secret generic node003 \
  --namespace dps \
  --from-literal=bmc='{"username":"admin","password":"your-secure-password"}' \
  --dry-run=client -o yaml | kubectl apply -f -

Option B: Using YAML Manifest

Create a file named node003-secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: node003
  namespace: dps
  labels:
    app: bmc-secret
type: Opaque
stringData:
  bmc: |
    {
      "username": "admin",
      "password": "your-secure-password"
    }

Apply the secret:

kubectl apply -f node003-secret.yaml

Step 3: Verify Device Specifications

Ensure the device specification for your new node type exists in DPS:

# List available device specifications
dpsctl device list

If the device specification doesn’t exist, import it:

# Import device specifications (if needed)
dpsctl device upsert devices.yaml

Example device specification for DGX GB200:

- type: ComputerSystem
  description: NVIDIA GB200 Compute Tray (Bianca)
  model: DGX_GB200
  spec:
    devices:
      - type: CPU
        model: Grace
        count: 2
      - type: GPU
        model: GB200
        count: 4
    minLoadWatts: 900
    maxLoadWatts: 5640
    processorModulesCount: 2

Step 4: Create Updated Topology Configuration

Create a new topology file that includes the existing nodes plus the new node(s).

Option A: Export Current Topology and Modify

# Export current topology
dpsctl topology export --topology datacenter --filename current-topology.json

Edit the exported file to add the new node:

{
  "Entities": [
    {
      "Type": "PowerDomain",
      "Name": "PD-A",
      "Constraints": {
        "PowerValue": {"Value": 1150000, "Type": "W"},
        "PowerFactor": 0.9
      }
    },
    {
      "Type": "ComputerSystem",
      "Model": "DGX_GB200",
      "Name": "node001",
      "Policy": "Node-High",
      "Redfish": {
        "@odata.type": "#ComputerSystem.v1_23_0.ComputerSystem",
        "@odata.id": "/node001",
        "Id": "node001",
        "URL": "https://node001-bmc.example.com",
        "SecretName": "node001"
      }
    },
    {
      "Type": "ComputerSystem",
      "Model": "DGX_GB200",
      "Name": "node002",
      "Policy": "Node-High",
      "Redfish": {
        "@odata.type": "#ComputerSystem.v1_23_0.ComputerSystem",
        "@odata.id": "/node002",
        "Id": "node002",
        "URL": "https://node002-bmc.example.com",
        "SecretName": "node002"
      }
    },
    {
      "Type": "ComputerSystem",
      "Model": "DGX_GB200",
      "Name": "node003",
      "Policy": "Node-High",
      "Redfish": {
        "@odata.type": "#ComputerSystem.v1_23_0.ComputerSystem",
        "@odata.id": "/node003",
        "Id": "node003",
        "URL": "https://node003-bmc.example.com",
        "SecretName": "node003"
      }
    }
  ],
  "Topology": {
    "Name": "datacenter",
    "Entities": [
      {
        "Name": "PD-A",
        "Children": ["node001", "node002", "node003"]
      },
      {
        "Name": "node001"
      },
      {
        "Name": "node002"
      },
      {
        "Name": "node003"
      }
    ]
  }
}

Option B: Create New Topology File

Create a new file updated-topology.json with all nodes including the new one.

Step 5: Validate the Updated Topology

Validate the topology file before importing:

# Validate the updated topology
dpsctl topology validate --filename updated-topology.json

Expected Output:

{
  "status": {
    "ok": true,
    "diag_msg": "Topology validation passed"
  }
}

If validation fails, fix the errors before proceeding.

Step 6: Update the Topology

Update the existing topology with the new configuration:

# Update the topology with new nodes
dpsctl topology update --filename updated-topology.json

Expected Output:

{
  "status": {
    "ok": true,
    "diag_msg": "Topology updated successfully"
  }
}

Step 7: Verify the Update

Verify that the new node has been added to the topology:

# List all entities to confirm the new node is included
dpsctl topology list-entities

# Check topology details
dpsctl topology list

Expected Output:

{
  "topologies": [
    {
      "topology_name": "datacenter",
      "is_active": true,
      "leaf_node_names": ["node001", "node002", "node003"]
    }
  ]
}

Step 8: Test Node Connectivity

Test connectivity to the new node’s BMC:

# Test connectivity to the new node
dpsctl check connection --topology datacenter --nodes node003

Expected Output:

{
  "total_nodes": 1,
  "success_nodes": 1
}

You can also test connectivity to all nodes in the topology:

# Test connectivity to all nodes
dpsctl check connection --topology datacenter

Expected Output:

{
  "total_nodes": 3,
  "success_nodes": 3
}

Step 9: Reactivate the Topology (if needed)

If the topology was deactivated during the update, reactivate it:

# Reactivate the topology
dpsctl topology activate --topology datacenter --ping-hosts

Flags:

  • --ping-hosts Test connectivity of all hosts (recommended)
  • --at-least-percent-hosts Minimum percent of hosts that must be reachable (default: 50)

Step 10: Verify Power Management

Verify that the new node is properly integrated into power management:

# Check node status
dpsctl check status

# Verify power policies are applied
dpsctl policy list

Troubleshooting

Common Issues and Solutions

1. BMC Connection Failures

Symptoms:

  • Node connection check fails
  • Power policy application errors

Solutions:

  • Verify BMC credentials are correct
  • Check network connectivity to BMC
  • Ensure BMC is accessible from DPS server
  • Verify BMC Redfish API is enabled
# Test BMC connectivity manually
curl -k -u admin:password https://node003-bmc.example.com/redfish/v1

2. Device Specification Not Found

Symptoms:

  • Validation errors about unknown device type/model

Solutions:

  • Import the required device specifications
  • Verify the device type and model match existing specifications
# Import device specifications
dpsctl device upsert devices.yaml

3. Secret Not Found

Symptoms:

  • Authentication errors when connecting to BMC

Solutions:

  • Verify the secret name matches the SecretName in the topology
  • Check that the secret is in the correct namespace
  • Ensure the secret contains the correct bmc key
# Verify secret exists
kubectl get secret node003 -n dps

# Check secret contents (be careful with sensitive data)
kubectl get secret node003 -n dps -o jsonpath='{.data.bmc}' | base64 -d

4. Topology Update Conflicts

Symptoms:

  • Update fails with hash mismatch errors

Solutions:

  • Use --force flag to bypass hash validation (use with caution)
  • Export the current topology and merge changes manually
  • Ensure no other changes were made to the topology
# Force update (use with caution)
dpsctl topology update --filename updated-topology.json --force

5. Power Policy Application Issues

Symptoms:

  • Nodes not responding to power policy changes
  • Power limit errors

Solutions:

  • Verify the node supports the specified power policy plugin
  • Check that the BMC supports the required Redfish endpoints
  • Ensure the device specification includes the correct powerPolicyPlugin