Managing Resource Groups
Overview
This guide provides step-by-step instructions for managing resource groups in DPS.
For more information, see Resource Groups.
Note: Resource groups are typically managed by a workload scheduler. This guide demonstrates how to manage them directly.
Prerequisites
- DPS server running and accessible
dpsctlinstalled and authenticated- Active topology with entities configured
Basic Resource Group Workflow
Step 1: Create Resource Group
dpsctl resource-group create \
--resource-group "ml-training-job" \
--external-id 12345 \
--policy "Node-High"Step 2: Add Hardware Resources
dpsctl resource-group add \
--resource-group "ml-training-job" \
--entities "node001,node002,node003"Step 3: (Optional) Configure Per-GPU Power Policies
If your workload requires fine-grained GPU power control, set per-GPU power limits before or after activation:
dpsctl resource-group update \
--resource-group "ml-training-job" \
--entity-gpu-policy node001=500,550,600,700,650,700,550,600Step 4: Activate Power Policies
dpsctl resource-group activate \
--resource-group "ml-training-job"Step 5: Cleanup
dpsctl resource-group delete \
--resource-group "ml-training-job"Dynamic Resource Management
Resources can be added to or removed from resource groups at any time, including after activation. This enables dynamic workload scaling without full resource group deactivation.
Note: Add/remove operations are rejected while a resource group is activating or deactivating. Wait for the operation to complete before modifying resources.
Adding Resources to an Active Resource Group
When adding resources to an active resource group, policies are applied immediately and power is reallocated as needed:
dpsctl resource-group add \
--resource-group "ml-training-job" \
--entities "node004,node005"Strict Policy Enforcement
Use --strict-policy to ensure the requested policy is applied exactly. If power constraints prevent the policy from being satisfied, the operation fails rather than automatically downgrading:
dpsctl resource-group add \
--resource-group "ml-training-job" \
--entities "node004" \
--policy "Node-High" \
--strict-policyControlling Power Reprovisioning
By default, power may be redistributed from other resource groups if needed (power stealing). Use --allow-reprovision=false to prevent this:
dpsctl resource-group add \
--resource-group "ml-training-job" \
--entities "node004" \
--allow-reprovision=falseRemoving Resources from an Active Resource Group
When removing resources from an active resource group, policies on the removed entities are reverted to topology defaults and power is reallocated:
dpsctl resource-group remove \
--resource-group "ml-training-job" \
--entities "node004,node005"Troubleshooting
Verifying Node Status
After activating a resource group, always verify that each individual node’s policy was applied successfully:
- Check that each node in
node_statuseshas"ok": true - Review any error messages in
diag_msgfields - If any nodes show
"ok": false, investigate the specific error before proceeding
Example of a failed node:
"node004": {
"status": {
"ok": false,
"diag_msg": "BMC connection timeout"
}
}PRS
Power Reservation Steering (PRS) is a product optionally included and directly integrated with DPS that performs real-time power allocation adjustment for resource groups. PRS consumes telemetry provided by DPS to generate recommendations for power allocations at a per-GPU level, which DPS validates against the PDN constraints before applying. PRS can be controlled in a few ways via DPS:
- Not deployed (configured via Helm chart values option
.Values.prs.enabled) - Toggled via global configuration in the DPS WebUI
- Toggled per ResourceGroup during creation with option
--prs-enabled=falseIt is recommended to deploy PRS and leave it enabled globally. For Max-P Resource Group, we recommend disabling PRS and leaving it on for all others.
Further Reading
- Resource Group Concepts - Understanding resource groups
- Resource Group CLI Commands - Detailed command reference
- Power Policies - Understanding power policy hierarchy