Managing Resource Groups
Managing Resource Groups
Overview
This guide provides step-by-step instructions for managing resource groups in DPS.
For more information, see Resource Groups.
Note: Resource groups are typically not managed directly, but instead expected to be managed by HPC job schedulers, like SLURM. This guide exists to demonstrate the raw resource group management workflow. See Integrating with SLURM.
Prerequisites
- DPS server running and accessible
dpsctlinstalled and authenticated- Active topology with entities configured
Basic Resource Group Workflow
Step 1: Create Resource Group
dpsctl resource-group create \
--resource-group "ml-training-job" \
--external-id 12345 \
--policy "Node-High"Step 2: Add Hardware Resources
dpsctl resource-group add \
--resource-group "ml-training-job" \
--entities "node001,node002,node003"Step 3: Activate Power Policies
dpsctl resource-group activate \
--resource-group "ml-training-job"Step 4: Cleanup
dpsctl resource-group delete \
--resource-group "ml-training-job"Troubleshooting
Verifying Node Status
After activating a resource group, always verify that each individual node’s policy was applied successfully:
- Check that each node in
node_statuseshas"ok": true - Review any error messages in
diag_msgfields - If any nodes show
"ok": false, investigate the specific error before proceeding
Example of a failed node:
"node004": {
"status": {
"ok": false,
"diag_msg": "BMC connection timeout"
}
}PRS
Power Reservation Steering (PRS) is a product optionally included and directly integrated with DPS that performs real-time power allocation adjustment for resource groups. PRS consumes telemetry provided by DPS to generate recommendations for power allocations at a per-GPU level, which DPS validates against the PDN constraints before applying. PRS can be controlled in a few ways via DPS:
- Not deployed (configured via Helm chart values option
.Values.prs.enabled) - Toggled via global configuration in the DPS WebUI
- Toggled per ResourceGroup during creation with option
--prs-enabled=falseIt is recommended to deploy PRS and leave it enabled globally. For Max-P Resource Group, we recommend disabling PRS and leaving it on for all others.
Further Reading
- Resource Group Concepts - Understanding resource groups
- Resource Group CLI Commands - Detailed command reference
- Integrating With SLURM - SLURM and other scheduler integration
- Power Policies - Understanding power policy hierarchy