Power Policies

Power Policies

Overview

Power policies define specific power limits and configurations that can be applied to topology entities. They specify maximum power consumption for different component types (nodes, GPUs, CPUs, memory) and serve as the primary mechanism for controlling power usage across datacenter equipment.

Policy Structure

Each power policy includes:

  • Name - Unique identifier (e.g., “Node-High”, “GPU-Efficiency”)
  • Limits - Power constraints for different component types

Power limits can be specified as:

  • Absolute Watts - Fixed power limits (e.g., 700W per GPU)
  • Percentage - Relative to device capabilities (e.g., 80% of max power)

Both formats are supported and can be mixed within the same policy.

Policy Examples

High Performance Policy

{
  "Name": "Node-High",
  "Limits": [
    {"ElementType": "Node", "PowerLimit": {"Watts": 10200}},
    {"ElementType": "GPU", "PowerLimit": {"Watts": 7650}},
    {"ElementType": "CPU", "PowerLimit": {"Watts": 1530}},
    {"ElementType": "Memory", "PowerLimit": {"Watts": 1020}}
  ]
}

Note: Different hardware platforms may use different power limits.

Balanced Policy

{
  "Name": "Node-Med",
  "Limits": [
    {"ElementType": "Node", "PowerLimit": {"Watts": 7140}},
    {"ElementType": "GPU", "PowerLimit": {"Watts": 5355}},
    {"ElementType": "CPU", "PowerLimit": {"Watts": 1071}},
    {"ElementType": "Memory", "PowerLimit": {"Watts": 714}}
  ]
}

Power Saving Policy

{
  "Name": "Node-Low",
  "Limits": [
    {"ElementType": "Node", "PowerLimit": {"Watts": 5250}},
    {"ElementType": "GPU", "PowerLimit": {"Watts": 2800}},
    {"ElementType": "CPU", "PowerLimit": {"Watts": 720}},
    {"ElementType": "Memory", "PowerLimit": {"Watts": 480}}
  ]
}

Policy using percentage limits

Percentage-based policies allow power limits to be specified as a percentage of the device’s maximum power capacity, making them more portable across different hardware configurations:

Note: When using percentage limits, DPS calculates the actual watt values based on the device specifications. For example, a GPU with maxLoadWatts: 700 and a 80% limit would result in a 560W power limit.

{
  "Name": "Node-Efficiency",
  "Limits": [
    {"ElementType": "Node", "PowerLimit": {"Percentage": 70}},
    {"ElementType": "GPU", "PowerLimit": {"Percentage": 80}},
    {"ElementType": "CPU", "PowerLimit": {"Percentage": 75}},
    {"ElementType": "Memory", "PowerLimit": {"Percentage": 60}}
  ]
}

Mixed Policy (Absolute and Percentage)

Policies can combine both absolute watt values and percentage limits for different component types:

{
  "Name": "Node-Hybrid",
  "Limits": [
    {"ElementType": "Node", "PowerLimit": {"Percentage": 85}},
    {"ElementType": "GPU", "PowerLimit": {"Watts": 600}},
    {"ElementType": "CPU", "PowerLimit": {"Percentage": 90}},
    {"ElementType": "Memory", "PowerLimit": {"Watts": 800}}
  ]
}

Policy Application Hierarchy

Power policies are applied in a three-level hierarchy:

1. Topology Default Policies (Base Level)

{
  "Type": "ComputerSystem",
  "Name": "node001",
  "Policy": "Node-Med"
}

2. Resource Group Policies (Override Level)

dpsctl resource-group create \
  --resource-group "ml-training" \
  --policy "Node-High"

3. Entity-Specific Policies (Granular Level)

dpsctl resource-group update \
  --resource-group "ml-training" \
  --entity node001 \
  --policy "GPU-Optimized"

Complete Policy Set Example

{
  "Policies": [
    {
      "Name": "Node-Low",
      "Limits": [
        {"ElementType": "Node", "PowerLimit": {"Watts": 5250}},
        {"ElementType": "GPU", "PowerLimit": {"Watts": 2800}},
        {"ElementType": "CPU", "PowerLimit": {"Watts": 720}},
        {"ElementType": "Memory", "PowerLimit": {"Watts": 480}}
      ]
    },
    {
      "Name": "Node-Med",
      "Limits": [
        {"ElementType": "Node", "PowerLimit": {"Watts": 7140}},
        {"ElementType": "GPU", "PowerLimit": {"Watts": 5355}},
        {"ElementType": "CPU", "PowerLimit": {"Watts": 1071}},
        {"ElementType": "Memory", "PowerLimit": {"Watts": 714}}
      ]
    },
    {
      "Name": "Node-High",
      "Limits": [
        {"ElementType": "Node", "PowerLimit": {"Watts": 10200}},
        {"ElementType": "GPU", "PowerLimit": {"Watts": 7650}},
        {"ElementType": "CPU", "PowerLimit": {"Watts": 1530}},
        {"ElementType": "Memory", "PowerLimit": {"Watts": 1020}}
      ]
    }
  ]
}

Usage

Import Policies

# Import policies from topology file
dpsctl topology import datacenter.json

# List available policies
dpsctl policy list

Apply to Entities

# Set entity default policy
dpsctl entity update node001 --policy "Node-High"

Use in Resource Groups

# Create resource group with policy
dpsctl resource-group create \
  --resource-group "ml-training" \
  --policy "Node-High"

# Update resource group policy
dpsctl resource-group update \
  --resource-group "ml-training" \
  --policy "Node-Med"

GPU-Level Control

# Set per-GPU power limits for the node in the resource group
dpsctl gpu-policy \
  --node node001=500,550,600,700,650,700,550,600

Policy Implementation

Power policies are implemented through device-specific plugins that:

  • Translate policy limits into hardware commands
  • Communicate with BMCs through Redfish APIs
  • Monitor compliance and report status
# Device specification references policy plugin
- type: ComputerSystem
  model: DGX_H100
  spec:
    powerPolicyPlugin: DGX_H100

For the standard devices that are bundled with DPS, specifying powerPolicyPlugin is optional. Internally DPS stores default power plugin for the given model name, if applicable.

Further Reading