Managing Resource Groups

Managing Resource Groups

Overview

This guide provides step-by-step instructions for managing resource groups in DPS.

For more information, see Resource Groups.

Note: Resource groups are typically not managed directly, but instead expected to be managed by HPC job schedulers, like SLURM. This guide exists to demonstrate the raw resource group management workflow. See Integrating with SLURM.

Prerequisites

  • DPS server running and accessible
  • dpsctl installed and authenticated
  • Active topology with entities configured

Basic Resource Group Workflow

Step 1: Create Resource Group

dpsctl resource-group create \
  --resource-group "ml-training-job" \
  --external-id 12345 \
  --policy "Node-High"

Step 2: Add Hardware Resources

dpsctl resource-group add \
  --resource-group "ml-training-job" \
  --entities "node001,node002,node003"

Step 3: Activate Power Policies

dpsctl resource-group activate \
  --resource-group "ml-training-job"

Step 4: Cleanup

dpsctl resource-group delete \
  --resource-group "ml-training-job"

Troubleshooting

Verifying Node Status

After activating a resource group, always verify that each individual node’s policy was applied successfully:

  1. Check that each node in node_statuses has "ok": true
  2. Review any error messages in diag_msg fields
  3. If any nodes show "ok": false, investigate the specific error before proceeding

Example of a failed node:

"node004": {
  "status": {
    "ok": false,
    "diag_msg": "BMC connection timeout"
  }
}

PRS

Power Reservation Steering (PRS) is a product optionally included and directly integrated with DPS that performs real-time power allocation adjustment for resource groups. PRS consumes telemetry provided by DPS to generate recommendations for power allocations at a per-GPU level, which DPS validates against the PDN constraints before applying. PRS can be controlled in a few ways via DPS:

  • Not deployed (configured via Helm chart values option .Values.prs.enabled)
  • Toggled via global configuration in the DPS WebUI
  • Toggled per ResourceGroup during creation with option --prs-enabled=false It is recommended to deploy PRS and leave it enabled globally. For Max-P Resource Group, we recommend disabling PRS and leaving it on for all others.

Further Reading