Resource Groups
Resource Groups
Overview
Resource groups allow workload schedulers, like SLURM, to dynamically group topology entities together and override their default power policies. This enables workload-specific power management and serves as the primary interface between workload schedulers and the DPS power management system.
Resource Group Structure
A resource group consists of:
- Workload Information - External job/workload identification and metadata
- Hardware Resources - Compute nodes, GPUs, and other power-managed entities
- Power Policies - Specific power configurations applied to the hardware resources
- Lifecycle State - Active or inactive states
Power Policy Hierarchy
Resource groups use a three-level policy system:
- Topology Default - Base policy for all hardware (e.g., Node-Med)
- Resource Group Default - Workload-specific override (e.g., Node-High for ML training)
- Entity-Specific - Granular control per hardware component (e.g., GPU-Optimized for specific nodes)
Each level can override the previous one, allowing precise power management from datacenter-wide defaults down to individual components.
Integration with Job Schedulers
Resource groups are typically created automatically through integration with HPC job schedulers like SLURM.
SLURM integration uses prolog and epilog scripts that execute before and after job execution.
SLURM Integration Example
SLURM integration is configured in the slurm.conf file using PrologSlurmctld and EpilogSlurmctld parameters:
The following is a simple of how this can be achieved.
# slurm.conf
PrologSlurmctld=/usr/share/dps/prolog.sh
EpilogSlurmctld=/usr/share/dps/epilog.shProlog Script (Job Start)
The prolog script runs before job execution and:
- Creates a resource group with the job ID as the external identifier
- Adds allocated compute nodes to the resource group
- Activates power policies to apply them to hardware
- Configures workload-specific settings (policies, PRS, DPM, etc.)
#!/bin/bash
# SLURM Prolog - Create and activate resource group
# Extract job information
JOB_NAME=${SLURM_JOB_NAME}
JOB_ID=${SLURM_JOB_ID}
NODES=${SLURM_JOB_NODELIST}
# Convert SLURM nodelist to comma-separated format
NODE_LIST=$(scontrol show hostname ${NODES} | tr '\n' ',' | sed 's/,$//')
# Create resource group
dpsctl resource-group create \
--resource-group "${JOB_NAME}" \
--external-id ${JOB_ID} \
--policy "Node-High"
# Add allocated nodes
dpsctl resource-group add \
--resource-group "${JOB_NAME}" \
--entities "${NODE_LIST}"
# Activate resource group
dpsctl resource-group activate \
--resource-group "${JOB_NAME}"Epilog Script (Job End)
The epilog script runs after job completion and:
- Deletes the resource group
This results in the devices returning to their original power configuration set in the topology.
#!/bin/bash
# SLURM Epilog - Clean up resource group
JOB_NAME=${SLURM_JOB_NAME}
# Delete resource group (automatically deactivates)
dpsctl resource-group delete \
--resource-group "${JOB_NAME}"Job Comment Integration
Advanced SLURM integration supports parsing DPS settings from job comments:
# Submit job with DPS settings in comment
sbatch --comment="dps_policy:Node-High,dps_prs:false,dps_dpm:true" job_script.shThe prolog script automatically parses these settings:
dps_policy:<string>- Sets power policydps_prs:<bool>- Disable Power Reservation Steeringdps_dpm:<bool>- Enables Dynamic Power Managementdps_wpps:<comma-separated ints>- Sets workload profile IDs
Further Reading
- Power Policies - Define resource group power configurations
- Entities - Hardware resources managed by resource groups
- Topologies - Infrastructure that provides baseline policies
- User Accounts - Authentication for automation integration
- SLURM Integration Guide - Detailed SLURM setup and configuration
- Resource Group Management - Manual resource group operations