Resource Groups
Overview
Resource groups allow you to dynamically group topology entities together and override their default power policies. This enables workload-specific power management.
Resource Group Structure
A resource group consists of:
- Workload Information - External job/workload identification and metadata
- Hardware Resources - Compute nodes, GPUs, and other power-managed entities
- Power Policies - Specific power configurations applied to the hardware resources
- Lifecycle State - Active or inactive states
Resource Group Lifecycle
Resource groups follow a flexible lifecycle that supports both static and dynamic resource management:
- CREATE - Create an empty, inactive resource group with optional default power policy
- ADD/REMOVE - Add or remove compute resources (nodes); for active groups, policies are applied or reverted immediately and power is reallocated
- ACTIVATE - Apply power policies to hardware and mark the group as active
- UPDATE (Optional) - Dynamically adjust power policies during workload execution
- DELETE - Deactivate and cleanup, restoring topology defaults
Dynamic Resource Management
Resources can be added to or removed from resource groups at any time, regardless of activation state:
- Inactive resource groups: Add/remove operations are database-only. Policies are stored but not applied until activation.
- Active resource groups: Add/remove operations take effect immediately. When adding resources, policies are applied to hardware and power is reallocated. When removing resources, policies are reverted to topology defaults.
- Activating/Deactivating resource groups: Add/remove operations are rejected while the resource group is transitioning. Wait for the activation or deactivation to complete before modifying resources.
This flexibility enables dynamic workload scaling and resource reallocation without requiring full resource group deactivation.
Power Policy Hierarchy
Resource groups use a multi-level policy system:
- Topology Default - Base policy for all hardware (e.g., Node-Med)
- Resource Group Default - Workload-specific override (e.g., Node-High for ML training)
- Entity-Specific - Granular control per hardware component (e.g., GPU-Optimized for specific nodes)
- Per-GPU Policies - Individual GPU power limits within a node (e.g., 500W for GPU0, 700W for GPU3)
Each level can override the previous one, allowing precise power management from datacenter-wide defaults down to individual GPUs.
Per-GPU Power Policies
In addition to node-level policies, resource groups support per-GPU power limits. This allows workloads to allocate different power budgets to individual GPUs based on their workload characteristics.
Per-GPU policies can be configured using dpsctl resource-group update --entity-gpu-policy or the standalone dpsctl gpu-policy command. See Resource Group Update and GPU Policy for details.
Further Reading
- Power Policies - Define resource group power configurations
- Entities - Hardware resources managed by resource groups
- Topologies - Infrastructure that provides baseline policies
- User Accounts - Authentication for automation integration
- Resource Group Management - Manual resource group operations