Power Policies - Creation, Application, and Enforcement

DPS Power Policies Guide

Overview

Power policies are the core mechanism for controlling power consumption in the Domain Power Service (DPS). They define specific power limits and configurations for different hardware components (nodes, GPUs, CPUs, memory) and serve as the primary tool for optimizing power usage across datacenter equipment while maintaining performance and reliability.

This guide covers policy creation, application strategies, enforcement mechanisms, and practical examples.

Policy Structure and Components

Basic Policy Structure

A DPS power policy consists of:

  • Name: Unique identifier for the policy (e.g., “Node-High”, “GPU-Efficiency”)
  • Limits: Power constraints for different component types

Power Limit Types

Power limits can be specified in two formats:

1. Absolute Watts

Fixed power limits in watts (total wattage for all devices of the given type):

{
  "ElementType": "GPU",
  "PowerLimit": {"Watts": 700.0}
}

Note: Absolute watt values represent the total power limit for all devices of that element type on the node, not per individual device. For example, 700W for GPU means 700W total across all GPUs on the node.

2. Percentage

Relative to device maximum capabilities:

{
  "ElementType": "GPU",
  "PowerLimit": {"Percentage": 80.0}
}

Element Types

Currently, DPS supports power limits for the following component types:

  • Node: Overall node-level power constraints
  • GPU: Graphics processing unit power limits
  • CPU: Central processing unit power limits
  • Memory: Memory subsystem power limits

Important Notes:

  • One Definition Per Type: Each element type can only have one power limit definition per policy
  • Optional Components: Not all element types need to be defined - unused types are simply ignored
  • Device-Specific Support: Depending on the hardware platform, some element types may not be supported and will be safely ignored
  • Generic Policies: This flexibility, combined with percentage-based limits, enables creating generic policy definitions that can apply to any device type, regardless of specific hardware capabilities

Policy Creation

Creating Policies via Topology Import

Policies are defined as part of the topology import file and loaded into DPS when the topology is imported:

{
  ...
  "Policies": [
    {
      "Name": "ML-Training-High",
      "Limits": [
        {"ElementType": "Node", "PowerLimit": {"Watts": 10200}},
        {"ElementType": "GPU", "PowerLimit": {"Watts": 7650}},
        {"ElementType": "CPU", "PowerLimit": {"Watts": 1530}},
        {"ElementType": "Memory", "PowerLimit": {"Watts": 1020}}
      ]
    },
    {
      "Name": "Inference-Balanced",
      "Limits": [
        {"ElementType": "Node", "PowerLimit": {"Percentage": 70}},
        {"ElementType": "GPU", "PowerLimit": {"Percentage": 65}},
        {"ElementType": "CPU", "PowerLimit": {"Percentage": 80}},
        {"ElementType": "Memory", "PowerLimit": {"Percentage": 60}}
      ]
    }
  ]
}

The entire configuration, including policies, entities, and topology definitions, is imported together using:

dpsctl topology import datacenter-config.json

Policy Creation Best Practices

  1. Meaningful Names: Use descriptive names that indicate the policy’s purpose

    • ML-Training-High, Inference-Efficient, Maintenance-Low
    • Policy1, Test, Default
  2. Consistent Scaling: Ensure power limits scale appropriately across components

    • GPU limits should reflect the workload’s compute requirements
    • CPU limits should account for coordination overhead
    • Memory limits should match data processing needs
  3. Hardware Compatibility: Consider different hardware platforms

    • Use percentage-based limits for portability
    • Validate against actual device specifications

Policy Assignment and Device Idle Policies

Device Specification Idle Policies

Device specifications can define idle policies as a protective mechanism for devices that don’t have explicit policies assigned. These are defined in the device YAML specification files using the idlePolicy node:

- type: ComputerSystem
# ...
    idlePolicy:
      Node:
        watts: 4000
      GPU:
        watts: 2000
      CPU:
        watts: 1500
      Memory:
        watts: 500

Device Idle Policy Characteristics:

  • Protection Mechanism: Ensures devices always have some form of power constraint
  • Component Types: Defined for the same component types as standalone policies (Node, GPU, CPU, Memory)
  • Absolute Watts Only: Only supports absolute watt values, not percentage-based limits
  • Device-Specific: Defined per device model in YAML specifications
  • Fallback Role: Used automatically when no explicit policy is assigned to a topology entity

Policy Assignment Flow

Understanding how policies are assigned and activated is crucial for effective power management:

  1. DPS Service Deployment: Initially, the DPS service is unaware of any topology configuration
  2. Topology Import: Administrator imports topology configuration using dpsctl topology import
    • Topology defines devices (entities) and available policies
    • Each topology entity can optionally have a policy assigned
  3. Topology Activation: Topology becomes active and default policy assignments take effect
  4. Policy Resolution: For each entity, DPS determines which policy to apply:
    • Explicit Policy: If entity has a policy assigned in topology → use that policy
    • Idle Policy: If entity has no policy assigned → use device specification idle policy
    • No Policy: If no idle policy exists → device operates without power constraints (and is treated as consuming maximum power; not recommended)

Policy Application Methods

1. Topology-Level Default Policies

Set default policies for all entities in a topology:

{
  "Type": "ComputerSystem",
  "Name": "node001",
  "Policy": "Node-Med"
}

2. Resource Group Policies

Override default policies for specific workloads:

# Create resource group with policy
dpsctl resource-group create \
  --resource-group "ml-training-job" \
  --external-id "slurm-12345" \
  --policy "ML-Training-High"

# Add nodes to the resource group
dpsctl resource-group add \
  --resource-group "ml-training-job" \
  --entities "node001,node002,node003"

# Activate the resource group
dpsctl resource-group activate \
  --resource-group "ml-training-job"

3. Resource Group Entity-Specific Policy Overrides

Apply granular policies to individual entities:

# Update specific entity policy within resource group
dpsctl resource-group update \
  --resource-group "ml-training-job" \
  --entity-policy "GPU-Optimized=node001" \
  --entity-policy "CPU-Intensive=node003"

4. Dynamic GPU-Level Policies

Set per-GPU power limits for fine-grained control:

# Set individual GPU limits (8 GPUs on node001)
dpsctl gpu-policy \
  --node "node001=500,550,600,700,650,700,550,600"

Policy Management Commands

Listing Policies

# List all available policies
dpsctl policy list

# Example output:
[
  {
    "Name": "Node-High",
    "Limits": [
      {"ElementType": "Node", "PowerLimit": {"Watts": 10200}},
      {"ElementType": "GPU", "PowerLimit": {"Watts": 7650}},
      {"ElementType": "CPU", "PowerLimit": {"Watts": 1530}},
      {"ElementType": "Memory", "PowerLimit": {"Watts": 1020}}
    ]
  }
]