Resource Groups

Overview

Resource groups allow you to dynamically group topology entities together and override their default power policies. This enables workload-specific power management.

Resource Group Structure

A resource group consists of:

Workload Information - External job/workload identification and metadata
Hardware Resources - Compute nodes, GPUs, and other power-managed entities
Power Policies - Specific power configurations applied to the hardware resources
Lifecycle State - Active or inactive states

Resource Group Lifecycle

Resource groups follow a flexible lifecycle that supports both static and dynamic resource management:

CREATE - Create an empty, inactive resource group with optional default power policy
ADD/REMOVE - Add or remove compute resources (nodes); for active groups, policies are applied or reverted immediately and power is reallocated
ACTIVATE - Apply power policies to hardware and mark the group as active
UPDATE (Optional) - Dynamically adjust power policies during workload execution
DELETE - Deactivate and cleanup, restoring topology defaults

Dynamic Resource Management

Resources can be added to or removed from resource groups at any time, regardless of activation state:

Inactive resource groups: Add/remove operations are database-only. Policies are stored but not applied until activation.
Active resource groups: Add/remove operations take effect immediately. When adding resources, policies are applied to hardware and power is reallocated. When removing resources, policies are reverted to topology defaults.
Activating/Deactivating resource groups: Add/remove operations are rejected while the resource group is transitioning. Wait for the activation or deactivation to complete before modifying resources.

This flexibility enables dynamic workload scaling and resource reallocation without requiring full resource group deactivation.

Power Policy Hierarchy

Resource groups use a multi-level policy system:

Topology Default - Base policy for all hardware (e.g., Node-Med)
Resource Group Default - Workload-specific override (e.g., Node-High for ML training)
Entity-Specific - Granular control per hardware component (e.g., GPU-Optimized for specific nodes)
Per-GPU Policies - Individual GPU power limits within a node (e.g., 500W for GPU0, 700W for GPU3)

Each level can override the previous one, allowing precise power management from datacenter-wide defaults down to individual GPUs.

Per-GPU Power Policies

In addition to node-level policies, resource groups support per-GPU power limits. This allows workloads to allocate different power budgets to individual GPUs based on their workload characteristics.

Per-GPU policies can be configured using dpsctl resource-group update --entity-gpu-policy or the standalone dpsctl gpu-policy command. See Resource Group Update and GPU Policy for details.