Managing Resource Groups

Overview

This guide provides step-by-step instructions for managing resource groups in DPS.

For more information, see Resource Groups.

Note: Resource groups are typically managed by a workload scheduler. This guide demonstrates how to manage them directly.

Prerequisites

  • DPS server running and accessible
  • dpsctl installed and authenticated
  • Active topology with entities configured

Basic Resource Group Workflow

Step 1: Create Resource Group

dpsctl resource-group create \
  --resource-group "ml-training-job" \
  --external-id 12345 \
  --policy "Node-High"

Step 2: Add Hardware Resources

dpsctl resource-group add \
  --resource-group "ml-training-job" \
  --entities "node001,node002,node003"

Step 3: (Optional) Configure Per-GPU Power Policies

If your workload requires fine-grained GPU power control, set per-GPU power limits before or after activation:

dpsctl resource-group update \
  --resource-group "ml-training-job" \
  --entity-gpu-policy node001=500,550,600,700,650,700,550,600

Step 4: Activate Power Policies

dpsctl resource-group activate \
  --resource-group "ml-training-job"

Step 5: Cleanup

dpsctl resource-group delete \
  --resource-group "ml-training-job"

Dynamic Resource Management

Resources can be added to or removed from resource groups at any time, including after activation. This enables dynamic workload scaling without full resource group deactivation.

Note: Add/remove operations are rejected while a resource group is activating or deactivating. Wait for the operation to complete before modifying resources.

Adding Resources to an Active Resource Group

When adding resources to an active resource group, policies are applied immediately and power is reallocated as needed:

dpsctl resource-group add \
  --resource-group "ml-training-job" \
  --entities "node004,node005"

Strict Policy Enforcement

Use --strict-policy to ensure the requested policy is applied exactly. If power constraints prevent the policy from being satisfied, the operation fails rather than automatically downgrading:

dpsctl resource-group add \
  --resource-group "ml-training-job" \
  --entities "node004" \
  --policy "Node-High" \
  --strict-policy

Controlling Power Reprovisioning

By default, power may be redistributed from other resource groups if needed (power stealing). Use --allow-reprovision=false to prevent this:

dpsctl resource-group add \
  --resource-group "ml-training-job" \
  --entities "node004" \
  --allow-reprovision=false

Removing Resources from an Active Resource Group

When removing resources from an active resource group, policies on the removed entities are reverted to topology defaults and power is reallocated:

dpsctl resource-group remove \
  --resource-group "ml-training-job" \
  --entities "node004,node005"

Troubleshooting

Verifying Node Status

After activating a resource group, always verify that each individual node’s policy was applied successfully:

  1. Check that each node in node_statuses has "ok": true
  2. Review any error messages in diag_msg fields
  3. If any nodes show "ok": false, investigate the specific error before proceeding

Example of a failed node:

"node004": {
  "status": {
    "ok": false,
    "diag_msg": "BMC connection timeout"
  }
}

PRS

Power Reservation Steering (PRS) is a product optionally included and directly integrated with DPS that performs real-time power allocation adjustment for resource groups. PRS consumes telemetry provided by DPS to generate recommendations for power allocations at a per-GPU level, which DPS validates against the PDN constraints before applying. PRS can be controlled in a few ways via DPS:

  • Not deployed (configured via Helm chart values option .Values.prs.enabled)
  • Toggled via global configuration in the DPS WebUI
  • Toggled per ResourceGroup during creation with option --prs-enabled=false It is recommended to deploy PRS and leave it enabled globally. For Max-P Resource Group, we recommend disabling PRS and leaving it on for all others.

Further Reading