Power Policies - Creation, Application, and Enforcement
DPS Power Policies Guide
Overview
Power policies are the core mechanism for controlling power consumption in the Domain Power Service (DPS). They define specific power limits and configurations for different hardware components (nodes, GPUs, CPUs, memory) and serve as the primary tool for optimizing power usage across datacenter equipment while maintaining performance and reliability.
This guide covers policy creation, application strategies, enforcement mechanisms, and practical examples.
Policy Structure and Components
Basic Policy Structure
A DPS power policy consists of:
- Name: Unique identifier for the policy (e.g., “Node-High”, “GPU-Efficiency”)
- Limits: Power constraints for different component types
Power Limit Types
Power limits can be specified in two formats:
1. Absolute Watts
Fixed power limits in watts (total wattage for all devices of the given type):
{
"ElementType": "GPU",
"PowerLimit": {"Watts": 700.0}
}Note: Absolute watt values represent the total power limit for all devices of that element type on the node, not per individual device. For example, 700W for GPU means 700W total across all GPUs on the node.
2. Percentage
Relative to device maximum capabilities:
{
"ElementType": "GPU",
"PowerLimit": {"Percentage": 80.0}
}Element Types
Currently, DPS supports power limits for the following component types:
- Node: Overall node-level power constraints
- GPU: Graphics processing unit power limits
- CPU: Central processing unit power limits
- Memory: Memory subsystem power limits
Important Notes:
- One Definition Per Type: Each element type can only have one power limit definition per policy
- Optional Components: Not all element types need to be defined - unused types are simply ignored
- Device-Specific Support: Depending on the hardware platform, some element types may not be supported and will be safely ignored
- Generic Policies: This flexibility, combined with percentage-based limits, enables creating generic policy definitions that can apply to any device type, regardless of specific hardware capabilities
Policy Creation
Creating Policies via Topology Import
Policies are defined as part of the topology import file and loaded into DPS when the topology is imported:
{
...
"Policies": [
{
"Name": "ML-Training-High",
"Limits": [
{"ElementType": "Node", "PowerLimit": {"Watts": 10200}},
{"ElementType": "GPU", "PowerLimit": {"Watts": 7650}},
{"ElementType": "CPU", "PowerLimit": {"Watts": 1530}},
{"ElementType": "Memory", "PowerLimit": {"Watts": 1020}}
]
},
{
"Name": "Inference-Balanced",
"Limits": [
{"ElementType": "Node", "PowerLimit": {"Percentage": 70}},
{"ElementType": "GPU", "PowerLimit": {"Percentage": 65}},
{"ElementType": "CPU", "PowerLimit": {"Percentage": 80}},
{"ElementType": "Memory", "PowerLimit": {"Percentage": 60}}
]
}
]
}The entire configuration, including policies, entities, and topology definitions, is imported together using:
dpsctl topology import datacenter-config.jsonPolicy Creation Best Practices
-
Meaningful Names: Use descriptive names that indicate the policy’s purpose
- ✅
ML-Training-High,Inference-Efficient,Maintenance-Low - ❌
Policy1,Test,Default
- ✅
-
Consistent Scaling: Ensure power limits scale appropriately across components
- GPU limits should reflect the workload’s compute requirements
- CPU limits should account for coordination overhead
- Memory limits should match data processing needs
-
Hardware Compatibility: Consider different hardware platforms
- Use percentage-based limits for portability
- Validate against actual device specifications
Policy Assignment and Device Idle Policies
Device Specification Idle Policies
Device specifications can define idle policies as a protective mechanism for devices that don’t have explicit policies assigned. These are defined in the device YAML specification files using the idlePolicy node:
- type: ComputerSystem
# ...
idlePolicy:
Node:
watts: 4000
GPU:
watts: 2000
CPU:
watts: 1500
Memory:
watts: 500Device Idle Policy Characteristics:
- Protection Mechanism: Ensures devices always have some form of power constraint
- Component Types: Defined for the same component types as standalone policies (Node, GPU, CPU, Memory)
- Absolute Watts Only: Only supports absolute watt values, not percentage-based limits
- Device-Specific: Defined per device model in YAML specifications
- Fallback Role: Used automatically when no explicit policy is assigned to a topology entity
Policy Assignment Flow
Understanding how policies are assigned and activated is crucial for effective power management:
- DPS Service Deployment: Initially, the DPS service is unaware of any topology configuration
- Topology Import: Administrator imports topology configuration using
dpsctl topology import- Topology defines devices (entities) and available policies
- Each topology entity can optionally have a policy assigned
- Topology Activation: Topology becomes active and default policy assignments take effect
- Policy Resolution: For each entity, DPS determines which policy to apply:
- Explicit Policy: If entity has a policy assigned in topology → use that policy
- Idle Policy: If entity has no policy assigned → use device specification idle policy
- No Policy: If no idle policy exists → device operates without power constraints (and is treated as consuming maximum power; not recommended)
Policy Application Methods
1. Topology-Level Default Policies
Set default policies for all entities in a topology:
{
"Type": "ComputerSystem",
"Name": "node001",
"Policy": "Node-Med"
}2. Resource Group Policies
Override default policies for specific workloads:
# Create resource group with policy
dpsctl resource-group create \
--resource-group "ml-training-job" \
--external-id "slurm-12345" \
--policy "ML-Training-High"
# Add nodes to the resource group
dpsctl resource-group add \
--resource-group "ml-training-job" \
--entities "node001,node002,node003"
# Activate the resource group
dpsctl resource-group activate \
--resource-group "ml-training-job"3. Resource Group Entity-Specific Policy Overrides
Apply granular policies to individual entities:
# Update specific entity policy within resource group
dpsctl resource-group update \
--resource-group "ml-training-job" \
--entity-policy "GPU-Optimized=node001" \
--entity-policy "CPU-Intensive=node003"4. Dynamic GPU-Level Policies
Set per-GPU power limits for fine-grained control:
# Set individual GPU limits (8 GPUs on node001)
dpsctl gpu-policy \
--node "node001=500,550,600,700,650,700,550,600"Policy Management Commands
Listing Policies
# List all available policies
dpsctl policy list
# Example output:
[
{
"Name": "Node-High",
"Limits": [
{"ElementType": "Node", "PowerLimit": {"Watts": 10200}},
{"ElementType": "GPU", "PowerLimit": {"Watts": 7650}},
{"ElementType": "CPU", "PowerLimit": {"Watts": 1530}},
{"ElementType": "Memory", "PowerLimit": {"Watts": 1020}}
]
}
]